Zum Hauptinhalt gehen
Erstellt am 15. Mai 2026

Senior AI Platform Engineer

eMFusion Global
Berlin, Berlin 10115, Germany Vollzeit
Reference: 810973689

About the Role

We are working with a leading international consultancy that is building scalable, production-grade AI SaaS products within their dedicated AI Lab. This is a greenfield opportunity - you will combine deep technical expertise with strategic vision to design and build AI-powered platforms that transform enterprise clients' business models.

The AI Lab is developing cutting-edge, large-scale AI products delivering sustained commercial impact. The team operates with a startup mindset: agile, flat hierarchies, and a genuine bias for experimentation and ownership.

The Opportunity

This is a rare full-stack platform engineering role that spans infrastructure architecture through to LLM operationalisation. You will own the platform layer end-to-end - from Kubernetes cluster operations and IaC through to model serving, RAG pipelines, and LLMOps.

Key themes of the role:
  • Design and evolve a multi-tenant SaaS architecture with tenant isolation, per-tenant controls, and enterprise security
  • Build automated tenant provisioning, safe rollouts (canary/feature flags), and noisy-neighbor protection
  • Operationalise LLMs end-to-end - fine-tuning, evaluation, high-performance serving, monitoring, and embeddings workflows
  • Drive MLOps foundations: automated training pipelines, experiment tracking, and scalable model deployment
  • Manage Kubernetes clusters, GPU-heavy workloads, and autoscaling on AWS
  • Build unified CI/CD pipelines shipping ML and application code seamlessly
  • Implement comprehensive observability: logs, metrics, traces, model/data drift detection
  • Embed enterprise security and compliance - IAM, RBAC, VPC design, secrets management, encryption - at every layer
  • Design well-architected ETL/ELT pipelines, streaming systems, feature store integration, and workflow orchestration

Technical Requirements

Platform & Multi-Tenancy
  • Proven patterns for tenant isolation (DB-per-tenant, schema-per-tenant, row-level security), tenant-aware caching, noisy-neighbor protection
  • OIDC/OAuth2, tenant-aware RBAC/ABAC, SCIM provisioning, and audit logging for B2B SaaS

Kubernetes & Infrastructure
  • Deep Kubernetes: cluster ops, HPA/VPA, node pools, GPU scheduling, Karpenter, PDBs, network policies, multi-AZ design
  • Service mesh (Istio/Linkerd), ingress patterns (ALB/Nginx), secure egress, mTLS
  • Infrastructure as Code beyond basics: Terraform modules, Terragrunt, policy-as-code (OPA/Conftest), secrets automation
  • GitOps (ArgoCD/Flux), progressive delivery (Argo Rollouts/Flagger), feature flags, canary and blue/green deployments

MLOps & Model Lifecycle
  • Model lifecycle tooling: MLflow/W&B, model registry, experiment tracking, reproducible training, dataset versioning (DVC/lakeFS)
  • Pipeline orchestration: Airflow, Prefect, or Dagster + artifact stores
  • Model serving: KServe, Seldon, BentoML, or Ray Serve - online, async/batch inference, autoscaling, rollback

LLMOps
  • Prompt and version management, offline + online evaluation harnesses, RAG evaluation (retrieval metrics, groundedness), guardrails, red-teaming basics
  • Streaming inference (SSE/WebSockets), caching, routing, fallback models
  • Vector DB experience: pgvector, Pinecone, Weaviate, or Milvus - embedding lifecycle, backfills, re-embedding, indexing strategies

Observability & Security
  • OpenTelemetry, tracing, SLOs - Prometheus/Grafana, Loki/ELK, Datadog/New Relic
  • Incident management: postmortems, runbooks, error budgets
  • GDPR, encryption at rest/in transit, secrets management (AWS Secrets Manager/Vault), KMS, key rotation
  • SOC 2 / ISO 27001 familiarity, vulnerability scanning (Trivy/Grype), SBOMs, SAST/DAST

About You

  • You have shipped and operated customer-facing SaaS products at scale with real users
  • You have owned end-to-end ML/AI infrastructure - from data ingestion through to production monitoring
  • You enable engineers and data scientists to move faster through self-service platforms and automated workflows
  • You have a track record of designing systems that scale globally across regions and traffic patterns
  • You are comfortable with incident response, on-call rotations, and stabilising critical production systems
  • You think with a product mindset - customer value, reliability, and speed-to-market over technology for its own sake
  • You have a strong bias for automation and eliminating manual operational toil
  • Excellent communication skills - async collaboration, documentation, and explaining technical decisions to non-technical audiences

What's on Offer

  • Genuine greenfield platform engineering ownership - build it from scratch
  • Startup atmosphere with flat hierarchies within a globally established firm
  • Hybrid working, international mobility across a wide office network
  • Extensive learning and development programmes
  • Competitive package including bonus

Jobbenachrichtigungen per Newsletter erhalten