- Company Name
- NVIDIA
- Job Title
- Senior Software Engineer - Storage
- Job Description
-
**Job Title**: Senior Software Engineer – Storage
**Role Summary**
Design, develop, and operate exascale distributed systems that manage data, compute, and networking for large‑scale AI workloads. Lead tooling and automation for orchestrating thousands of GPUs and petabytes of storage across multi‑region clusters, ensuring reliability, performance, and compliance within the Managed AI Research Superclusters (MARS) infrastructure.
**Expectations**
- Architect scalable, high‑performance infrastructure that supports frontier AI/ML research.
- Collaborate with AI/ML researchers, security, networking, and platform teams to translate research requirements into robust solutions.
- Drive continuous improvement in system reliability, performance, and observability to meet exascale standards.
- Participate in design reviews, contribute to architecture discussions, and influence NVIDIA’s AI infrastructure stack.
**Key Responsibilities**
- Design and maintain distributed systems for large‑scale AI workloads.
- Build automation for workload orchestration across thousands of GPUs and petabytes of storage in multi‑region clusters.
- Integrate high‑performance storage environments (e.g., Lustre, GPFS, BeeGFS).
- Implement compute scheduling and orchestration using Slurm, Kubernetes, or LSF.
- Develop and operationalize system monitoring, logging, and observability pipelines.
- Ensure infrastructure meets security, compliance, and data‑management standards.
- Stay current on distributed systems research, AI frameworks, and exascale computing advances.
**Required Skills**
- 8+ years of experience developing and operating large‑scale distributed systems or HPC environments.
- Proficiency in C++, Python, or Go; proven track record of building production‑ready software.
- Deep understanding of distributed systems principles, data management, and orchestration frameworks.
- Hands‑on experience with high‑performance storage (Lustre, GPFS, BeeGFS) and compute scheduling (Slurm, Kubernetes, LSF).
- Familiarity with cloud environments (Azure, AWS, GCP) and infrastructure automation tools (Terraform, Ansible, etc.).
- Strong problem‑solving, ownership mindset, and collaborative communication skills.
**Required Education & Certifications**
- Bachelor’s degree (or equivalent experience) in Computer Science, Computer Engineering, or a related technical field.
- Graduate degree (MS/PhD) in Computer Science, Distributed Systems, or related area is highly desirable.
---
California, United states
Remote
Senior
03-12-2025