Job Specifications
About the Role:
We are supporting a Site Reliability Engineering (SRE) organization that functions as a Center
of Excellence for observability and reliability across a large enterprise environment. This team
owns the strategy, standards, and tooling that enable application and infrastructure teams to
operate reliable, scalable systems.
**This is not a traditional DevOps or platform engineering role. The focus is on observability,
reliability outcomes, and enablement, not building new infrastructure or managing security. **
Top 3 Skills:
1. Hands-on Observability & Monitoring: Strong experience building and maintaining
dashboards (Grafana is key), working with logs, metrics, and traces, and supporting day-to-
day observability needs for application and infrastructure teams. Candidates should be
comfortable using monitoring tools to identify and troubleshoot issues.
2. Core SRE Practices & Operational Reliability: Working knowledge of SRE fundamentals
such as SLIs, SLOs, alerting, and incident response. Experience improving alert quality,
supporting on-call teams, and helping reduce MTTR through better instrumentation and
operational practices.
3. Collaboration & Enablement Mindset: Ability to work closely with multiple engineering teams,
understand their systems, and help them adopt better reliability and observability practices.
Clear communication and a team-first attitude are important, as this role supports and
enables others rather than owning a single application.
Key Responsibilities:
Partner with application and infrastructure teams to define and implement SLIs, SLOs, and
error budgets
Design and standardize observability dashboards used by engineering teams and
leadership
Improve alert quality, reduce noise, and drive MTTR reduction
Support incident analysis and reliability improvements using metrics and post-incident
learnings
Lead or contribute to observability strategy using tools such as:
Grafana (required)
Sumo Logic, AppDynamics, New Relic, Dynatrace, Datadog (one or more)
Enable teams to effectively use observability and monitoring platforms
Create repeatable standards, templates, and best practices
Support automation initiatives using Ansible Automation Platform
Automate operational tasks and configuration management to reduce manual toil
Collaborate with CI/CD and platform teams where automation intersects with reliability
Required Qualifications:
5+ years of experience in Site Reliability Engineering, SRE, or reliability-focused roles
Hands-on experience with observability and monitoring tools
Strong experience with Grafana (dashboard creation, metrics visualization)
Practical understanding of SLOs, SLIs, error budgets, and reliability metrics
Ability to clearly communicate technical concepts to engineering and non-engineering
stakeholders
Experience working across multiple teams in an enablement or advisory capacity
Preferred Qualifications:
Experience with Ansible Automation Platform
Exposure to CI/CD pipelines and infrastructure automation
Experience in large-scale or enterprise environments
Prior work in an SRE Center of Excellence or shared services model
About the Company
Astreya is the leading IT solutions provider for some of the world's most recognizable and innovative organizations. Our journey started in 2001 in the heart of Silicon Valley and reaches thirty-three countries with over 2200+ IT professionals. We enable businesses to make better decisions, achieve operational efficiency and gain a competitive edge. The Astreya advantage is centered around focus and clear- vision, world-class talent, and innovative technology: Creativity is in our DNA. Our dedicated Software and Service Inno...
Know more