Key Responsibilities
- Optimize inference engines for GPU acceleration
- Implement quantization and pruning techniques for model efficiency
- Develop low-latency inference pipelines
- Profile and benchmark performance bottlenecks
- Collaborate with ML engineers to integrate optimized models
- Research novel optimization techniques for large language models
Requirements
- 3+ years of experience in systems programming or ML optimization
- Expertise in C++ and GPU computing (CUDA/OpenCL)
- Experience with model quantization and pruning
- Strong understanding of computer architecture
- Familiarity with LLMs and transformer architectures