Key Responsibilities
- Design and operate production LLM serving stacks (vLLM, TGI, Triton) and vector databases (Pinecone, Weaviate, Qdrant)
- Build evaluation harnesses for AI features covering accuracy, hallucination, regression, latency, and cost
- Own prompt registries, versioning, model routing, A/B testing, and rollback paths as production artifacts
- Instrument AI workflows with LangSmith, OpenTelemetry, Prometheus, and Grafana; define SLOs and lead incident response
- Drive cost discipline through batching, prompt caching, smaller-model routing, and inference optimization
- Mentor engineers and set team standards for AI-assisted engineering tools (Claude, Cursor)
Requirements
- 5+ years of engineering experience with 2+ years in MLOps or production AI infrastructure
- Hands-on production ownership of LLM/ML systems at scale with on-call and scaling decisions
- Proficiency in Python, FastAPI, Docker, Kubernetes, and AWS (EC2, S3, EKS, IAM)
- Experience with inference tooling (vLLM, TGI, Triton) and evaluation frameworks (LangSmith, Prometheus)
- Strong written/verbal communication to defend architectural decisions and mentor teams