- Company Name
- Mistplay
- Job Title
- Senior ML Platform / ML Infrastructure Engineer II
- Job Description
-
Job title: Senior ML Platform / ML Infrastructure Engineer II
Role Summary: Design, build, and maintain end‑to‑end production ML pipelines and infrastructure, ensuring reliable, scalable, and low‑latency model serving across hybrid cloud environments.
Expectations:
- Deliver high‑quality ML solutions that directly impact business metrics.
- Demonstrate ownership from data ingestion to model monitoring, with a focus on automation, observability, and cost control.
- Collaborate cross‑functionally with data science, security, SRE, and product teams.
Key Responsibilities:
- Create and maintain standardized training‑to‑deployment pipelines using Airflow, integrating artifact management, environment provisioning, packaging, and CI/CD for SageMaker endpoints.
- Engineer real‑time and batch inference workflows on SageMaker (multi‑model endpoints, serverless inference, blue/green and canary strategies).
- Optimize low‑latency inference using Redis/Valkey for feature caching, online retrieval, request‑level state, response caching, and rate limiting.
- Provision and manage ML/ data infrastructure with Terraform: SageMaker endpoints, ECR/ECS/EKS, VPC/network endpoints, ElastiCache/Valkey clusters, observability stacks, secrets, and IAM.
- Develop platform abstractions and "golden paths": Airflow DAGs, CLI/SDK, cookie‑cutter repos, CI/CD pipelines to push models from notebooks to production predictably.
- Govern model lifecycle: model/feature registries, approval workflows, promotion policies, lineage, and audit trails integrated with Airflow and Terraform state.
- Implement end‑to‑end observability: data/feature freshness checks, drift/quality controls, SLOs for latency/performance, infrastructure health dashboards, traceability, alerts, incident response, and post‑mortem analysis.
- Work with security, SRE, and data engineering on private networking, IaC security policies, PII handling, least‑privilege IAM, and cost‑optimized architectures.
- Evaluate, integrate, and rationalize platform tools (MLflow registry, feature stores, service gateways); lead migrations with clear change management and minimal downtime.
Required Skills:
- 5+ years of building production ML/data platforms focused on service, reliability, and developer experience.
- Strong software engineering in Python; experience in Go or Java preferred.
- Proficient with SageMaker, Airflow, Terraform, Docker, Kubernetes (EKS), EC2/ECR, ElastiCache/Valkey, and CI/CD pipelines.
- Expertise in real‑time and batch inference strategies, cost‑aware scaling (spot, sizing).
- Deep knowledge of observability: monitoring, logging, tracing, SLOs, alerting, incident response.
- Familiarity with ML lifecycle governance: model registries, feature stores, lineage, audit.
- Experience in securing ML services: IAM, network isolation, PII protection, policy‑as‑code.
Required Education & Certifications:
- Bachelor’s or Master’s degree in Computer Science, Engineering, or related technical field.
- Optional industry certifications: AWS Certified Machine Learning – Specialty, AWS Certified DevOps Engineer, or equivalent.