Erstellt am 15. Mai 2026
Lead Kernel Engineer/Architect (m/f/d)
EPAM
München, Bayern 80331, Germany
Vollzeit
Reference: 87604099
We're looking for a Lead Kernel Engineer/Architect to join our team in Germany in a hybrid working mode. Are you passionate about pushing advanced hardware accelerators to their limits? Join us in shaping the future of AI performance and scalability. As a Lead Kernel Engineer/Architect, you will drive the optimization of critical machine learning operations for large-scale training and inference, working with cutting-edge hardware like TPUs and GPUs, advanced ML models and performance toolchains. Your work will enable faster AI research and production deployments on cloud platforms and within open-source ecosystems. In this role, you will collaborate with researchers, compiler engineers and framework developers to deliver optimized, high-performance solutions that set the standard for modern AI computation. Responsibilities Design and optimize high-performance kernels for TPU and GPU architectures using low-level programming frameworks such as Pallas, Triton or Mosaic Build and maintain performance infrastructure, including benchmarking suites, autotuning systems, regression testing frameworks and tooling Collaborate with ML framework developers (e.g., JAX, PyTorch) and compiler teams (XLA/MLIR) to integrate custom kernels and reduce performance bottlenecks Track advancements in accelerator hardware, compiler technology and AI model design to identify opportunities for kernel-level optimization Develop clear documentation, APIs and supporting OSS components that improve developer usability and adoption Analyze and resolve complex performance issues impacting large-scale distributed training and inference systems Requirements Bachelor's degree or equivalent practical experience 12+ years of industry experience in software engineering or systems programming 5+ years of experience in software development using C++ or Python 3+ years of experience in testing, maintaining or launching software products and at least 1 year in software design or architecture Hands-on experience in performance optimization at the kernel level for accelerators or high-performance systems Nice to have Proficiency in low-level accelerator programming (CUDA, Triton, Pallas) Familiarity with ML frameworks such as JAX or PyTorch and optimization techniques for attention layers, Mixture of Experts (MoE) and precision tuning Strong understanding of modern hardware accelerators, including pipelining, data movement and heterogeneous compute Knowledge of compiler principles and intermediate representations (e.g., MLIR, OpenXLA) Experience building OSS developer infrastructure, APIs and performance-critical libraries Excellent problem-solving skills and ability to collaborate in cross-functional engineering environments