cover image
LinkedIn

Distinguished Software Engineer, Reliability Infra

Hybrid

Mountain view, United states

$ 390,000 /year

Senior

Full Time

15-10-2025

Share this job:

Skills

Communication Leadership Incident Response CI/CD Monitoring Architecture Prometheus Grafana

Job Specifications

Company Description

LinkedIn is the worlds largest professional network, built to create economic opportunity for every member of the global workforce. Our products help people make powerful connections, discover exciting opportunities, build necessary skills, and gain valuable insights every day. Were also committed to providing transformational opportunities for our own employees by investing in their growth. We aspire to create a culture thats built on trust, care, inclusion, and fun where everyone can succeed.

Job Description

At LinkedIn, our approach to flexible work is centered on trust and optimized for culture, connection, clarity, and the evolving needs of our business. The work location of this role is hybrid, meaning it will be performed both from home and from a LinkedIn office on select days, as determined by the business needs of the team.

This role will be based in Sunnyvale, CA or San Francisco, CA.

Responsibilities

Serve as a senior technical leader driving the long-term reliability and observability strategy across LinkedIn's infrastructure
Re-architect LinkedIn's backend systems to enable granular failure domains and reduce the blast radius of incidents
Design and implement next-generation failure mitigation strategies that avoid full-region or full-datacenter failovers
Partner closely with across many different types of engineers to raise the bar for operational excellence and incident response
Define and build frameworks to improve monitoring, alerting, and observability across hundreds of services and systems
Define and own the roadmap of bringing observability to critical user journeys for LinkedIn's products to help capture and improve the experience of LinkedIn's members/customers
Spearhead a multi-year initiative to transition LinkedIn's infrastructure to a regionalized model with localized failover, enhancing both scalability and availability
Lead technical discussions on the future of Engineering at LinkedIn, what the function should evolve into over the next 3- 5 years
Deliver key insights, executive level reporting across the cross-functional engineering teams to enable the right business decisions around improving quality and reliability of our services and products
Act as a force multiplier by mentoring engineers, influencing technical direction across orgs, and contributing deeply to culture, hiring, and technical excellence
Lead incident response and post-incident reviews to identify root causes and implement preventive measures.
Develop and maintain incident management processes and procedures to ensure timely resolution of issues and minimize impact on customers

Qualifications

Basic Qualifications

15+ years of software engineering experience
8+ years focused on infrastructure, reliability focused engineering, or distributed systems

Preferred Qualifications

Hands-on experience with large-scale incident response, root cause analysis, and resiliency engineering
Strong communication and cross-functional collaboration skills, with experience influencing across multiple orgs and leadership levels
Proven success designing and leading architectural transformations at internet-scale companies
Deep knowledge of systems reliability, observability frameworks, and fault-tolerant architecture design
Experience with multi-region architecture, capacity planning, and failover strategies in large-scale cloud or hybrid environments
Background in CI/CD, platform reliability, and automation of ops-heavy systems.
Familiarity with modern observability stacks (e.g., OpenTelemetry, Prometheus, Grafana) and service mesh architecture
Track record of setting long-term technical strategy and driving systemic improvements in availability and performance
Previous experience in a Distinguished Engineer or equivalent role at a high-growth or web-scale technology company

Suggested Skills

Site Reliability Engineering (SRE)
Leadership
Large scale infrastructure

LinkedIn is committed to fair and equitable compensation practices. The pay range for this role is $238,000 to $390,000. Actual compensation packages are based on several factors that are unique to each candidate, including but not limited to skill set, depth of experience, certifications, and specific work location. This may be different in other locations due to differences in the cost of labor. The total compensation package for this position may also include annual performance bonus, stock, benefits and/or other applicable incentive compensation plans. For more information, visit https://careers.linkedin.com/benefits

Additional Information

Equal Opportunity Statement

We seek candidates with a wide range of perspectives and backgrounds and we are proud to be an equal opportunity employer. LinkedIn considers qualified applicants without regard to race, color, religion, creed, gender, national origin, age, disability, veteran status, marital status, pregnancy, sex, gender expression or identity, sexual orientation, citizenship, or a

About the Company

Founded in 2003, LinkedIn connects the world's professionals to make them more productive and successful. With more than 1 billion members worldwide, including executives from every Fortune 500 company, LinkedIn is the world's largest professional network. The company has a diversified business model with revenue coming from Talent Solutions, Marketing Solutions, Sales Solutions and Premium Subscriptions products. Headquartered in Silicon Valley, LinkedIn has offices across the globe.. Know more