Key Responsibilities
- Optimize inference engines for maximum throughput and minimal latency
- Develop GPU-accelerated algorithms for large language models
- Implement model quantization and pruning techniques
- Profile and benchmark inference performance across hardware
- Collaborate with hardware teams to maximize hardware utilization
- Create tools for automated performance testing and validation
Requirements
- 5+ years in systems programming or performance engineering
- Expertise in C++ and GPU computing (CUDA/OpenCL)
- Experience with model optimization techniques
- Strong debugging and profiling skills
- Knowledge of computer architecture and memory hierarchies