Erstellt am 15. Mai 2026
Senior AI Platform Engineer
eMFusion Global
Berlin, Berlin 10115, Germany
Vollzeit
Reference: 810973689
About the Role
We are working with a leading international consultancy that is building scalable, production-grade AI SaaS products within their dedicated AI Lab. This is a greenfield opportunity - you will combine deep technical expertise with strategic vision to design and build AI-powered platforms that transform enterprise clients' business models.
The AI Lab is developing cutting-edge, large-scale AI products delivering sustained commercial impact. The team operates with a startup mindset: agile, flat hierarchies, and a genuine bias for experimentation and ownership.
The Opportunity
This is a rare full-stack platform engineering role that spans infrastructure architecture through to LLM operationalisation. You will own the platform layer end-to-end - from Kubernetes cluster operations and IaC through to model serving, RAG pipelines, and LLMOps.
Key themes of the role:
Technical Requirements
Platform & Multi-Tenancy
Kubernetes & Infrastructure
MLOps & Model Lifecycle
LLMOps
Observability & Security
About You
What's on Offer
We are working with a leading international consultancy that is building scalable, production-grade AI SaaS products within their dedicated AI Lab. This is a greenfield opportunity - you will combine deep technical expertise with strategic vision to design and build AI-powered platforms that transform enterprise clients' business models.
The AI Lab is developing cutting-edge, large-scale AI products delivering sustained commercial impact. The team operates with a startup mindset: agile, flat hierarchies, and a genuine bias for experimentation and ownership.
The Opportunity
This is a rare full-stack platform engineering role that spans infrastructure architecture through to LLM operationalisation. You will own the platform layer end-to-end - from Kubernetes cluster operations and IaC through to model serving, RAG pipelines, and LLMOps.
Key themes of the role:
- Design and evolve a multi-tenant SaaS architecture with tenant isolation, per-tenant controls, and enterprise security
- Build automated tenant provisioning, safe rollouts (canary/feature flags), and noisy-neighbor protection
- Operationalise LLMs end-to-end - fine-tuning, evaluation, high-performance serving, monitoring, and embeddings workflows
- Drive MLOps foundations: automated training pipelines, experiment tracking, and scalable model deployment
- Manage Kubernetes clusters, GPU-heavy workloads, and autoscaling on AWS
- Build unified CI/CD pipelines shipping ML and application code seamlessly
- Implement comprehensive observability: logs, metrics, traces, model/data drift detection
- Embed enterprise security and compliance - IAM, RBAC, VPC design, secrets management, encryption - at every layer
- Design well-architected ETL/ELT pipelines, streaming systems, feature store integration, and workflow orchestration
Technical Requirements
Platform & Multi-Tenancy
- Proven patterns for tenant isolation (DB-per-tenant, schema-per-tenant, row-level security), tenant-aware caching, noisy-neighbor protection
- OIDC/OAuth2, tenant-aware RBAC/ABAC, SCIM provisioning, and audit logging for B2B SaaS
Kubernetes & Infrastructure
- Deep Kubernetes: cluster ops, HPA/VPA, node pools, GPU scheduling, Karpenter, PDBs, network policies, multi-AZ design
- Service mesh (Istio/Linkerd), ingress patterns (ALB/Nginx), secure egress, mTLS
- Infrastructure as Code beyond basics: Terraform modules, Terragrunt, policy-as-code (OPA/Conftest), secrets automation
- GitOps (ArgoCD/Flux), progressive delivery (Argo Rollouts/Flagger), feature flags, canary and blue/green deployments
MLOps & Model Lifecycle
- Model lifecycle tooling: MLflow/W&B, model registry, experiment tracking, reproducible training, dataset versioning (DVC/lakeFS)
- Pipeline orchestration: Airflow, Prefect, or Dagster + artifact stores
- Model serving: KServe, Seldon, BentoML, or Ray Serve - online, async/batch inference, autoscaling, rollback
LLMOps
- Prompt and version management, offline + online evaluation harnesses, RAG evaluation (retrieval metrics, groundedness), guardrails, red-teaming basics
- Streaming inference (SSE/WebSockets), caching, routing, fallback models
- Vector DB experience: pgvector, Pinecone, Weaviate, or Milvus - embedding lifecycle, backfills, re-embedding, indexing strategies
Observability & Security
- OpenTelemetry, tracing, SLOs - Prometheus/Grafana, Loki/ELK, Datadog/New Relic
- Incident management: postmortems, runbooks, error budgets
- GDPR, encryption at rest/in transit, secrets management (AWS Secrets Manager/Vault), KMS, key rotation
- SOC 2 / ISO 27001 familiarity, vulnerability scanning (Trivy/Grype), SBOMs, SAST/DAST
About You
- You have shipped and operated customer-facing SaaS products at scale with real users
- You have owned end-to-end ML/AI infrastructure - from data ingestion through to production monitoring
- You enable engineers and data scientists to move faster through self-service platforms and automated workflows
- You have a track record of designing systems that scale globally across regions and traffic patterns
- You are comfortable with incident response, on-call rotations, and stabilising critical production systems
- You think with a product mindset - customer value, reliability, and speed-to-market over technology for its own sake
- You have a strong bias for automation and eliminating manual operational toil
- Excellent communication skills - async collaboration, documentation, and explaining technical decisions to non-technical audiences
What's on Offer
- Genuine greenfield platform engineering ownership - build it from scratch
- Startup atmosphere with flat hierarchies within a globally established firm
- Hybrid working, international mobility across a wide office network
- Extensive learning and development programmes
- Competitive package including bonus