- Company Name
- Avenue Code
- Job Title
- Sr. Site Reliability Engineer (SRE)
- Job Description
-
Job Title: Sr. Site Reliability Engineer (SRE)
Role Summary: Design, build, and operate a cloud platform, ensuring reliability, performance, security, and cost efficiency for production systems.
Expactations: Collaborate across product and engineering teams, own incident response, mentor developers in DevOps practices, and continuously improve service health and cloud spend.
Key Responsibilities:
- Automate provisioning and CI/CD with Terraform, GitHub Actions, ArgoCD, and other pipelines.
- Define SLIs/SLOs, manage error budgets, and create dashboards and alerts for proactive monitoring.
- Enforce least‑privilege IAM, automate vulnerability scanning, and maintain audit logs for compliance.
- Instrument services with metrics, logs, and distributed tracing; support custom metrics and dashboards.
- Lead on‑call rotations, conduct incident investigations, post‑mortems, and implement improvements.
- Optimize cloud costs through tagging, right‑sizing, and data‑driven decisions.
- Produce runbooks, standards, and best‑practice guides; coach teams on DevOps, reliability, and security patterns.
Required Skills:
- 5+ years of production-critical system operation.
- Deep expertise in AWS cloud and cloud‑native best practices.
- Experience managing Kubernetes (EKS, GKE) at scale and container orchestration.
- Proficiency with Terraform for declarative infrastructure.
- Knowledge of Redis, PostgreSQL, VPC, VPN, load balancing, and cloud networking.
- Strong Git workflow, CI/CD integration, and understanding of web/network protocols (HTTP, REST, TLS, DNS).
- Fluent English (written & spoken).
Required Education & Certifications:
- Bachelor’s degree or equivalent in Computer Science, Engineering, or related field.
Mountain view, United states
Hybrid
Mid level
04-12-2025