|
Principal Software Architect - Redmond Washington
Company: NVIDIA Location: Redmond, Washington
Posted On: 04/26/2024
We are now looking for a Principal Software Architect for AI and HPC. At NVIDIA, we are advancing the frontiers of AI capabilities. We seek an expert in high-performance computing and AI to design and develop software resiliency features for training AI models on the world's most powerful and largest supercomputers. In this role, you will outline mission requirements for ultra large-scale AI supercomputers, thoroughly investigate and evaluate RAS feature designs, establish software requirements and evaluation metrics, and oversee the complete implementation of RAS features in software. As a leader in HPC and AI software development, you will interact with multiple teams across the organization. Your responsibilities include conducting regular reviews and check-ins with execution teams, ensuring the timely delivery of essential RAS software features such as checkpoint-recovery logic, error detection and attribution, error containment, SDC detection, and other related RAS elements. Leading cross-organizational efforts among various stakeholders and teams, you will coordinate priorities with senior leadership, provide timely updates, and ensure adequate resourcing for the projects. What You'll Be Doing: - Collaborate with both internal and external customers and partners to define innovative Reliability, Availability, and Serviceability (RAS) requirements and objectives for present and future AI supercomputing products.
- Oversee and guide the development of RAS features across the entire AI stack, encompassing aspects from job-level scheduling and AI application frameworks (such as PyTorch), down to driver-level and hardware health monitoring on GPUs.
- Develop and maintain comprehensive software roadmaps, ensuring alignment with diverse engineering teams and synchronizing with engineering and product leadership for strategic coherence.
- Drive successful implementation and execution of RAS features in software, with demonstrable improvements in end-to-end metrics such as availability during large-scale training runs. What We Need to See:
- A Master's or Ph.D. in Computer Science, Electrical or Computer Engineering from a reputed university, or equivalent professional experience.
- 15+ years of industry experience in systems architecture or related fields, demonstrating a deep understanding of system complexities.
- Proven ability to work and communicate effectively in a collaborative environment, bridging multiple engineering disciplines.
- At least 5 years of hands-on experience in software development, preferably in high-complexity projects involving HPC or AI. Ways to Stand Out From the Crowd:
|
|