thrively

1 Job

7 Employees

About the Company

Thrively exists out of a conviction that organizations need strong recruiting and HR partners like never before. We are a team of seasoned professionals who have all held senior leadership positions, sat on executive teams, and worked closely with boards. We've had to navigate scaling up quickly and scaling down. In a post-pandemic era we understand the stress that is placed on business leaders. Talent is demanding more work flexibility and authenticity from their leaders than ever before while leaders are feeling the demand from investors to stay lean and get to and maintain profitably. It's no longer growth at all costs and everyone is trying to adjust. We know how to partner with leaders in this new dynamic to ultimately put the right talent in the right roles at the right time.

Listed Jobs

Company Name: thrively
Job Title: Principal Site Reliability Engineer
Job Description: **Job Title** Principal Site Reliability Engineer **Role Summary** Design, own, and evolve the high‑availability, scalable, and secure cloud runtime for a regulated B2B SaaS platform. Act as the senior systems and architecture authority, ensuring reliability, performance, and cost‑effective growth while aligning with product, security, and IT teams. **Expectations** - Lead end‑to‑end architecture decisions for cloud‑native services, treating infrastructure and reliability as core product features. - Drive proactive risk identification, system resilience, and failure isolation rather than reactive ticket handling. - Foster cross‑functional collaboration to embed reliability into product roadmaps and delivery cycles. **Key Responsibilities** - Own the platform’s cloud‑runtime design, focusing on high availability, predictable scaling, fault isolation, and cost control in regulated environments. - Develop and maintain detailed architecture documentation, diagrams, and operational runbooks. - Partner with product, engineering, security, and IT/network teams to translate reliability requirements into concrete system designs. - Lead incident response, root‑cause analysis, and post‑incident improvements with a preventative focus. - Define and enforce monitoring, observability, and alerting strategies that meet SLAs and customer expectations. - Assess and mitigate systemic risks, such as load‑balancing constraints, network fail‑over scenarios, and third‑party integration dependencies. - Advise on AKS/Kubernetes deployment practices, scaling policies, and resilience patterns suitable for peak, deadline‑driven workloads. **Required Skills** - Principal‑level systems architecture and judgment. - Deep knowledge of AKS/Kubernetes, including cluster design, deployment strategies, scaling, and failure modes. - Experience designing and operating cloud‑native SaaS platforms in regulated or high‑reliability contexts. - Proficient in distributed systems concepts: traffic ingress/egress, load balancing, inter‑service communication, and network‑related failure propagation. - Strong incident‑management, post‑mortem, and preventative improvement skills. - Expertise in monitoring, observability, and metrics-driven reliability. - Ability to translate technical reliability requirements into actionable product and engineering outcomes. - Excellent documentation, communication, and cross‑team collaboration abilities. **Required Education & Certifications** - Bachelor’s (or higher) degree in Computer Science, Engineering, or related technical field. - Optional: Relevant cloud or Kubernetes certifications (e.g., Kubernetes Administrator / Application Developer, Azure Kubernetes Service, AWS Certified DevOps Engineer, Google Cloud Professional Cloud DevOps Engineer).