About the team
NVIDIA has continuously reinvented itself over two decades. Our invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI — the next era of computing. NVIDIA is a 'learning machine' that constantly evolves by adapting to new opportunities that are hard to solve, require parallel computing, and have a big impact. This is our life's work to amplify human imagination and intelligence.
We are building a new team to drive the architecture and development of systems software and solutions that enable researchers and developers to easily and effectively utilize NVIDIA's state-of-the-art AI infrastructure and products. We need passionate, hardworking, and creative people to help us build a world-class system!
What you'll be doing:
- Designing and implementing distributed systems software for large-scale AI infrastructure.
- Developing software for scheduling and resource management across many GPUs and servers.
- Optimizing performance and ensuring low-latency operation of critical AI workloads.
- Collaborating with hardware and software teams to integrate new technologies and improve overall system efficiency.
- Debugging and resolving complex issues in a distributed environment.
- Participate in architectural discussions, design reviews, and code reviews.
What we need to see:
- BS or MS degree in Computer Science, Computer Engineering, or a related field (or equivalent experience).
- 8+ years of experience in software development, with a focus on system-level programming.
- Expertise in Python and C++.
- Strong background in distributed systems, including experience with scheduling, resource management, and inter-process communication.
- Solid understanding of Linux operating systems, including kernel-level programming concepts.
- Experience with performance optimization and debugging tools.
- Proven ability to work independently and as part of a collaborative team.
Ways to stand out from the crowd:
- Experience with CUDA programming.
- Familiarity with containerization technologies (e.g., Docker, Kubernetes) and workload managers (e.g., SLURM).
- Understanding of machine learning frameworks and workflows.
- Experience with cloud technologies (e.g., AWS, Azure, GCP).
- Strong problem-solving skills and a passion for building high-performance, scalable systems.