- Company Name
- ExaTech Inc
- Job Title
- Site Reliability Engineer Architect
- Job Description
-
**Job Title:** Site Reliability Engineering (SRE) Architect
**Role Summary:**
Strategic technical leader responsible for designing and evolving reliable, scalable, and cost‑effective infrastructure and application patterns. Sets SRE standards, blueprints, and frameworks, drives automation, and embeds reliability principles across development and platform teams.
**Expectations:**
- Define and champion SRE best practices, SLIs/SLOs, and error‑budget management.
- Provide architectural guidance and mentorship to engineering teams.
- Lead incident postmortems and translate findings into systemic improvements.
- Evaluate, prototype, and adopt tools/technologies that enhance reliability and operational efficiency.
**Key Responsibilities:**
- Architect highly available, secure, and cost‑optimized solutions on AWS.
- Create and evangelize SRE standards for service design, deployment, monitoring, and readiness.
- Assess and advance observability maturity (Dynatrace, Prometheus, Grafana, ELK/EFK, Jaeger, OpenTelemetry).
- Design automation to reduce toil (IaC, CI/CD, automated remediation, chaos engineering).
- Advise development and platform teams on reliability, scalability, and performance during design phases.
- Conduct architectural reviews and production readiness assessments.
- Lead blameless postmortems, prioritize root‑cause remediation, and implement resilience patterns (circuit breaking, rate limiting, graceful degradation).
- Mentor SREs and engineers, fostering a culture of reliability and continuous improvement.
**Required Skills:**
- Proven architectural experience in reliability, scalability, and performance engineering.
- Deep knowledge of SRE concepts: SLIs/SLOs, error budgets, toil reduction, incident management.
- Expertise with AWS services (compute, networking, security, storage).
- Strong container and orchestration skills (Kubernetes, Docker, serverless).
- Hands‑on experience building observability stacks (Dynatrace, Prometheus, Grafana, ELK/EFK, Jaeger, OpenTelemetry).
- Proficient in scripting/programming (Python, Go, Bash) for automation and tool development.
- Excellent analytical, problem‑solving, and strategic thinking abilities.
- Strong communication, collaboration, and leadership capabilities.
**Required Education & Certifications:**
- Bachelor’s degree in Computer Science, Engineering, or a related field (or equivalent practical experience).
- Preferred: AWS Certified Solutions Architect / Professional, Certified Kubernetes Administrator (CKA), or similar cloud/containers certifications.