- Company Name
- thrively
- Job Title
- Principal Site Reliability Engineer
- Job Description
-
**Job Title**
Principal Site Reliability Engineer
**Role Summary**
Design, own, and evolve the high‑availability, scalable, and secure cloud runtime for a regulated B2B SaaS platform. Act as the senior systems and architecture authority, ensuring reliability, performance, and cost‑effective growth while aligning with product, security, and IT teams.
**Expectations**
- Lead end‑to‑end architecture decisions for cloud‑native services, treating infrastructure and reliability as core product features.
- Drive proactive risk identification, system resilience, and failure isolation rather than reactive ticket handling.
- Foster cross‑functional collaboration to embed reliability into product roadmaps and delivery cycles.
**Key Responsibilities**
- Own the platform’s cloud‑runtime design, focusing on high availability, predictable scaling, fault isolation, and cost control in regulated environments.
- Develop and maintain detailed architecture documentation, diagrams, and operational runbooks.
- Partner with product, engineering, security, and IT/network teams to translate reliability requirements into concrete system designs.
- Lead incident response, root‑cause analysis, and post‑incident improvements with a preventative focus.
- Define and enforce monitoring, observability, and alerting strategies that meet SLAs and customer expectations.
- Assess and mitigate systemic risks, such as load‑balancing constraints, network fail‑over scenarios, and third‑party integration dependencies.
- Advise on AKS/Kubernetes deployment practices, scaling policies, and resilience patterns suitable for peak, deadline‑driven workloads.
**Required Skills**
- Principal‑level systems architecture and judgment.
- Deep knowledge of AKS/Kubernetes, including cluster design, deployment strategies, scaling, and failure modes.
- Experience designing and operating cloud‑native SaaS platforms in regulated or high‑reliability contexts.
- Proficient in distributed systems concepts: traffic ingress/egress, load balancing, inter‑service communication, and network‑related failure propagation.
- Strong incident‑management, post‑mortem, and preventative improvement skills.
- Expertise in monitoring, observability, and metrics-driven reliability.
- Ability to translate technical reliability requirements into actionable product and engineering outcomes.
- Excellent documentation, communication, and cross‑team collaboration abilities.
**Required Education & Certifications**
- Bachelor’s (or higher) degree in Computer Science, Engineering, or related technical field.
- Optional: Relevant cloud or Kubernetes certifications (e.g., Kubernetes Administrator / Application Developer, Azure Kubernetes Service, AWS Certified DevOps Engineer, Google Cloud Professional Cloud DevOps Engineer).