Key Responsibilities
- Optimize and deploy ML models for high-performance inference at scale
- Develop low-latency systems for real-time AI applications
- Implement quantization, pruning, and other optimization techniques
- Collaborate with hardware teams to maximize hardware utilization
- Benchmark and profile inference performance across different platforms
- Ensure reliability and efficiency of production inference pipelines
Requirements
- 3+ years in systems programming or ML inference optimization
- Expertise in C++ and Python for performance-critical applications
- Experience with GPU computing and CUDA programming
- Knowledge of model optimization techniques and hardware acceleration
- Strong debugging and profiling skills for performance tuning