- Company Name
- Oxenham Group
- Job Title
- Site Reliability Engineer
- Job Description
-
**Job Title**
Senior Site Reliability Engineer
**Role Summary**
Senior individual contributor responsible for designing, implementing, and operating highly reliable, secure, and cost‑effective AWS‑based infrastructure. Partners with engineering, support, and leadership to embed SRE best practices, improve observability, automate operations, and drive long‑term reliability and scalability of complex applications.
**Expectations**
- Maintain stable, observable, and resilient production systems.
- Lead incident response, root‑cause analysis, and deliver measurable reliability improvements.
- Ensure infrastructure scales predictably while optimizing cloud spend.
- Provide clear standards, automation, and tooling to support engineering teams.
- Embed reliability into the software delivery lifecycle and act as a change agent for technical strategy.
**Key Responsibilities**
- Design and evolve large‑scale AWS environments (EC2, ECS/EKS, Lambda, RDS, S3, IAM, VPC, CloudWatch).
- Build and manage infrastructure as code using Terraform, CloudFormation, or equivalents.
- Lead platform implementations and major reliability initiatives.
- Optimize cloud cost, performance, and availability.
- Implement and mature SRE practices: monitoring, alerting, logging, incident response, disaster recovery, capacity planning, and automation.
- Develop automation to reduce operational toil and improve efficiency.
- Provide advanced support for cloud‑hosted and hybrid platforms.
- Mentor junior/mid‑level engineers (technical guidance, best‑practice leadership).
- Collaborate with engineering, QA, security, and business teams to integrate reliability throughout the delivery process.
- Ensure compliance with legal, regulatory, and security requirements.
- Participate in on‑call rotation and work to reduce alert fatigue.
**Required Skills**
- 7+ years delivering technical solutions in production environments.
- 3+ years hands‑on SRE experience.
- Deep expertise in AWS architecture, services, and scaling.
- Strong Infrastructure as Code (Terraform, CloudFormation) skills.
- Proficiency in automation, scripting, and CI/CD pipelines.
- Ability to diagnose and resolve complex distributed system issues.
- Experience leading incident management and post‑incident improvements.
- Excellent written and verbal communication with technical and non‑technical audiences.
- Independent, prioritized multi‑tasking, and strong problem‑solving abilities.
**Required Education & Certifications**
- Bachelor’s degree in Computer Science, Information Systems, Engineering, or equivalent practical experience.
- Preferred: AWS certifications (Solutions Architect, DevOps Engineer, SysOps Administrator).