- Company Name
- Alibaba Cloud
- Job Title
- Site Reliability Engineer
- Job Description
-
**Job Title**
Site Reliability Engineer
**Role Summary**
A Site Reliability Engineer (SRE) responsible for designing, deploying, and maintaining a highly available, high‑performance model‑service platform. The role focuses on system reliability, incident response, and automation to ensure SLA adherence and continuous improvement of the platform’s operational excellence.
**Expactations**
- Deliver 99.9% uptime and rapid resolution of platform incidents.
- Participate in on‑call rotation and perform root‑cause analysis.
- Maintain and enhance monitoring, alerting, and logging for full observability.
- Create automated pipelines for deployment, scaling, and fault recovery.
- Communicate effectively with multilingual teams, using both Chinese and English.
**Key Responsibilities**
- Oversee end‑to‑end deployment, operation, maintenance, and continuous improvement of the platform.
- Monitor and alert system health, diagnose network, service, and hardware failures to meet SLA targets.
- Design and optimize metrics, log collection, and alerting strategies to improve observability.
- Lead emergency response and incident handling, conduct RCA, and implement long‑term fixes.
- Investigate customer‑reported API QoS issues, collaborating with devs to resolve latency, performance, and infrastructure bottlenecks.
- Develop and maintain tools/scripts (Python/Go) for automated deployment, scaling, fault recovery, and operational workflows.
- Build diagnostic toolchains to accelerate issue resolution and enhance customer satisfaction.
**Required Skills**
- 3+ years of SRE, DevOps, or backend development experience in distributed systems.
- Proficiency in at least one modern programming language: Python, Go, Java, or C++.
- Strong knowledge of Linux, TCP/HTTP, and relational/no‑SQL databases.
- Experience with cloud computing (Alibaba Cloud preferred) and cloud‑native architecture.
- Proficient with containers and Kubernetes cluster operations.
- Familiarity with monitoring and observability tools: Prometheus, Grafana, Loki, alerting mechanisms.
- Understanding of service mesh, networking (Istio, Calico), and fault‑tolerance concepts.
- Excellent incident‑management, problem‑solving, and on‑call readiness under pressure.
- Bilingual: fluent in Chinese and English.
**Required Education & Certifications**
- Bachelor’s degree (or higher) in Computer Science, Software Engineering, or a related field.
- Relevant certifications such as Certified Kubernetes Administrator (CKA) or Certified Kubernetes Application Developer (CKAD), Cloud Foundation Associate, or equivalent.
---