- Company Name
- Rivian
- Job Title
- Sr. Staff Site Reliability Engineer, Factory Infrastructure & Systems
- Job Description
-
**Job Title:** Sr. Staff Site Reliability Engineer, Factory Infrastructure & Systems
**Role Summary:**
Own reliability, scalability, and security of digital factory systems across compute, network, and application layers. Drive platform engineering, observability, and incident response for hybrid/on‑prem environments, ensuring production‑readiness, cost guardrails, and continuous improvement of MTTR and availability.
**Expactations:**
- Deliver reliable, cost‑effective platform foundations in a 24 × 7 manufacturing setting.
- Lead incident response and post‑mortem processes to eliminate repeat failures.
- Collaborate with Factory IT, Manufacturing Engineering, Security, and Networking to implement pragmatic, operable designs.
- Mentor peers and promote reliability best‑practices across teams.
**Key Responsibilities:**
- Design and evolve platform infrastructure (Kubernetes/EKS, vSphere/ESXi, Linux/Windows, industrial PCs).
- Define and enforce production‑readiness standards (health checks, SLO/SLI, runbooks, deployment safety).
- Implement IaC and configuration automation (Terraform/Terragrunt, Ansible) for provisioning, secrets, and policy enforcement.
- Build and maintain end‑to‑end telemetry (metrics, logs, traces) using Prometheus/Grafana, Loki/Tempo, Datadog, Splunk; create dashboards and alerting frameworks.
- Develop internal tooling (CLI/SDKs, operators, remediation bots) to automate detection and remediation.
- Act as technical incident responder; lead triage, stabilization, and post‑incident reviews.
- Conduct on‑call readiness drills, escalation policy reviews, and reliability simulations.
**Required Skills:**
- Proven SRE/Platform/DevOps experience with ownership of availability, performance, and cost.
- Strong expertise in Kubernetes/EKS, container networking, AWS services, vSphere/ESXi, Linux and Windows Server administration.
- Deep knowledge of observability stacks (Prometheus, Grafana, Loki, Tempo, Datadog, Splunk) and SLO/error‑budget practices.
- Proficiency in IaC (Terraform/Terragrunt), configuration management (Ansible), scripting (Python/Bash), GitOps, CI/CD (GitLab preferred), and policy‑as‑code.
- Demonstrated incident leadership in 24 × 7 environments with clear communication under pressure.
- Ability to partner across cross‑functional teams and convey technical trade‑offs simply.
**Required Education & Certifications:**
- Bachelor’s degree in Computer Science, Engineering, Information Technology, or related field (or equivalent practical experience).
- Relevant certifications (e.g., AWS Certified Solutions Architect, Certified Kubernetes Administrator) are a plus but not mandatory.