About the Role
NVIDIA is seeking a highly skilled and experienced Senior Staff Engineer for our Platform & Infrastructure team. In this role, you will be instrumental in designing, building, and maintaining our next-generation cloud infrastructure and platform services. You will work on highly scalable, reliable, and performant systems that support a wide range of NVIDIA's groundbreaking products and services.
What You'll Be Doing
- Design, develop, and maintain highly scalable, reliable, and secure cloud infrastructure and platform services (IaaS, PaaS, SaaS).
- Lead the adoption and implementation of cloud-native technologies such as Kubernetes, Docker, and serverless architectures across various cloud providers (AWS, Azure, GCP).
- Develop and implement automation tools and frameworks for infrastructure provisioning, configuration management (e.g., Terraform, Ansible), and application deployment.
- Optimize system performance, scalability, and cost-efficiency through continuous monitoring, analysis, and tuning.
- Collaborate closely with product and engineering teams to understand their requirements and provide robust infrastructure solutions.
- Establish and promote best practices for security, reliability, and operational excellence.
- Mentor junior engineers and contribute to the overall growth and technical excellence of the team.
- Troubleshoot complex issues across the entire stack, from infrastructure to application level.
- Participate in on-call rotation to ensure the continuous availability and performance of critical systems.
What We Need to See
- Bachelor's or Master's degree in Computer Science, Electrical Engineering, or a related field (or equivalent experience).
- 10+ years of experience in software development, with at least 5 years focused on platform engineering, infrastructure development, or site reliability engineering roles.
- Deep expertise in at least one major cloud platform (AWS, Azure, or GCP) and familiarity with others.
- Proficiency in containerization technologies (Docker, Kubernetes) and orchestration.
- Extensive experience with Infrastructure as Code (Terraform, CloudFormation) and configuration management tools (Ansible, Chef, Puppet).
- Strong programming skills in one or more languages: Python, Go, Rust, C++, Java, JavaScript.
- Solid understanding of distributed systems, microservices architectures, and their challenges.
- Experience with CI/CD pipelines and tools (e.g., Jenkins, GitLab CI, GitHub Actions).
- Familiarity with monitoring, logging, and alerting systems (e.g., Prometheus, Grafana, ELK stack).
- Excellent problem-solving skills and a strong ability to diagnose and resolve complex technical issues.
- Strong communication and collaboration skills, with the ability to work effectively in a fast-paced, dynamic environment.
- Proven leadership experience, including mentoring engineers and leading technical projects.
Ways to Stand Out From the Crowd
- Experience with database technologies (SQL and NoSQL).
- Knowledge of networking concepts and security best practices in cloud environments.
- Contributions to open-source projects.
- Experience with large-scale data processing or machine learning infrastructure.