Skills

Communication Incident Response CI/CD DevOps Kubernetes Monitoring Azure Kubernetes Service (AKS) Networking Architecture Systems Architecture Windows Databases Organization Azure CI/CD Pipelines

Job Specifications

(Systems Architecture & Cloud Runtime Ownership)

The Opportunity

This is not a traditional SRE, DevOps, or infrastructure role.

Our client is a B2B SaaS company building software that powers highly regulated financial, legal, and compliance workflows for enterprise customers. Their platforms support critical processes such as regulatory filings, transactions, and disclosures—often under strict deadlines where availability, correctness, and performance are non-negotiable.

When their systems are running, customers rarely think about them. When they’re not, the consequences are immediate and material. That’s why reliability, scalability, and operational correctness are treated as product features, not operational afterthoughts.

They are looking for a Principal Site Reliability Engineer who operates as a system architect and product partner—someone who designs how the product runs in the cloud, not just how infrastructure is maintained.

If you enjoy thinking deeply about how complex, high-stakes systems behave in production, and you want ownership over the architectural decisions that directly impact customers and the business, this role is likely to be very energizing.

Why This Role Exists

The company already has:

Product and application engineering teams
Cloud infrastructure and CI/CD pipelines
Monitoring and observability tooling
IT and network teams supporting corporate and security needs

What they don’t yet have is a principal-level systems owner who can step back and reason holistically about the platform and answer questions like:

How should this product be architected to behave reliably during peak, deadline-driven usage?
What happens when traffic spikes, dependencies fail, or regions degrade?
Where are the real failure modes, and how do we design around them before customers feel them?
How do architecture decisions affect customer trust, engineering velocity, and long-term cost?

This role exists because the organization has outgrown tactical SRE support and now needs senior architectural ownership of its cloud runtime.

What You’ll Do

As a Principal Site Reliability Engineer, you will:

Own the end-to-end design of how the SaaS platform runs in the cloud
Treat infrastructure, reliability, and scalability as core product concerns
Design and evolve architecture that supports:
High availability and resilience
Predictable scaling during critical business windows
Failure isolation and blast-radius reduction
Cost-aware growth in a regulated environment
Act as the most senior infrastructure and systems thinker in the organization
Partner closely with:
Product teams to align reliability expectations with customer impact
Engineering teams to ensure the platform enables safe, rapid delivery
Security and IT/network teams while still owning system-level decisions
Lead incident response and post-incident analysis with a strong focus on prevention
Create and maintain clear architecture documentation and diagrams
Proactively identify systemic risks and improvements rather than reacting to tickets

This is a role defined by judgment, ownership, and design, not task execution.

AKS Is Central to This Role

The platform runs on Azure Kubernetes Service (AKS), and AKS is treated as the runtime of the product, not “just infrastructure.”

Strong candidates don’t just say “I’ve used Kubernetes.”

They naturally think about:

How services should be designed assuming AKS is the runtime
How scaling behavior affects downstream systems and databases
What happens during pod, node, or regional failure
How traffic flows into, through, and between services
How deployment strategies affect risk during high-stakes periods

If you enjoy designing systems with Kubernetes as a first-class architectural constraint, this role will play directly to your strengths.

Networking as a System Concern

This role requires comfort reasoning about networking within distributed cloud systems, including:

Traffic ingress and egress
Load balancing and routing
Service-to-service communication
Third-party integrations and dependencies
Network-related failure modes and how they surface in production

You don’t need to be a pure network engineer—but you do need to understand how network behavior directly impacts reliability and customer experience.

What This Role Is Not

This role is not:

A ticket-driven operations role
A Kubernetes operator or tool administrator
A purely IT or corporate-networking position
A role where architecture is handed to you to maintain
A people-management role focused on headcount scaling

Candidates who prefer predefined systems and narrow execution responsibilities tend to struggle here.

What We’re Looking For

We’re looking for someone who brings:

Principal-level systems thinking and architectural judgment
Experience designing and operating cloud-native SaaS platforms, ideally in regulated or high-reliability environments
Deep AKS / Kubernetes architectural fluency
Comfort reasoning about distributed systems an

About the Company

Thrively exists out of a conviction that organizations need strong recruiting and HR partners like never before. We are a team of seasoned professionals who have all held senior leadership positions, sat on executive teams, and worked closely with boards. We've had to navigate scaling up quickly and scaling down. In a post-pandemic era we understand the stress that is placed on business leaders. Talent is demanding more work flexibility and authenticity from their leaders than ever before while leaders are feeling the demand... Know more