Skills

Communication Leadership Incident Response CI/CD DevOps Monitoring Configuration Management Ansible Organization CI/CD Pipelines Grafana

Job Specifications

About the Role:

We are supporting a Site Reliability Engineering (SRE) organization that functions as a Center

of Excellence for observability and reliability across a large enterprise environment. This team

owns the strategy, standards, and tooling that enable application and infrastructure teams to

operate reliable, scalable systems.

**This is not a traditional DevOps or platform engineering role. The focus is on observability,

reliability outcomes, and enablement, not building new infrastructure or managing security. **

Top 3 Skills:

1. Hands-on Observability & Monitoring: Strong experience building and maintaining

dashboards (Grafana is key), working with logs, metrics, and traces, and supporting day-to-

day observability needs for application and infrastructure teams. Candidates should be

comfortable using monitoring tools to identify and troubleshoot issues.

2. Core SRE Practices & Operational Reliability: Working knowledge of SRE fundamentals

such as SLIs, SLOs, alerting, and incident response. Experience improving alert quality,

supporting on-call teams, and helping reduce MTTR through better instrumentation and

operational practices.

3. Collaboration & Enablement Mindset: Ability to work closely with multiple engineering teams,

understand their systems, and help them adopt better reliability and observability practices.

Clear communication and a team-first attitude are important, as this role supports and

enables others rather than owning a single application.

Key Responsibilities:

Partner with application and infrastructure teams to define and implement SLIs, SLOs, and

error budgets

Design and standardize observability dashboards used by engineering teams and

leadership

Improve alert quality, reduce noise, and drive MTTR reduction

Support incident analysis and reliability improvements using metrics and post-incident

learnings

Lead or contribute to observability strategy using tools such as:

Grafana (required)

Sumo Logic, AppDynamics, New Relic, Dynatrace, Datadog (one or more)

Enable teams to effectively use observability and monitoring platforms

Create repeatable standards, templates, and best practices

Support automation initiatives using Ansible Automation Platform

Automate operational tasks and configuration management to reduce manual toil

Collaborate with CI/CD and platform teams where automation intersects with reliability

Required Qualifications:

5+ years of experience in Site Reliability Engineering, SRE, or reliability-focused roles

Hands-on experience with observability and monitoring tools

Strong experience with Grafana (dashboard creation, metrics visualization)

Practical understanding of SLOs, SLIs, error budgets, and reliability metrics

Ability to clearly communicate technical concepts to engineering and non-engineering

stakeholders

Experience working across multiple teams in an enablement or advisory capacity

Preferred Qualifications:

Experience with Ansible Automation Platform

Exposure to CI/CD pipelines and infrastructure automation

Experience in large-scale or enterprise environments

Prior work in an SRE Center of Excellence or shared services model

About the Company

Astreya is the leading IT solutions provider for some of the world's most recognizable and innovative organizations. Our journey started in 2001 in the heart of Silicon Valley and reaches thirty-three countries with over 2200+ IT professionals. We enable businesses to make better decisions, achieve operational efficiency and gain a competitive edge. The Astreya advantage is centered around focus and clear- vision, world-class talent, and innovative technology: Creativity is in our DNA. Our dedicated Software and Service Inno... Know more