Build a Safer World
TRM Labs provides AI-powered intelligence solutions that help public and private sector agencies investigate and disrupt crime. TRM's platforms enable investigators to trace illicit activity, build cases, and construct operating pictures of threat networks. Leading agencies and businesses worldwide rely on TRM to make the world safer and more secure.
The AI Engineering Team is chartered with enabling next-generation AI applications, with a special focus on Large Language Models (LLMs) and agentic systems. Our mission is to build robust pipelines, high-performance infrastructure, and operational tooling that allow AI systems to be deployed with speed, safety, and scale.
We manage petabyte-scale pipelines, serve models with millisecond-level latency, and provide the observability and governance needed to make AI production-ready. We’re also deeply involved in evaluating and integrating cutting-edge tools in the LLM and agent space — including open-source stacks, vector databases, evaluation frameworks, and orchestration tools that unlock TRM’s ability to innovate faster than the market.
About the Role
As a Senior or Staff ML Systems Engineer – LLM, you’ll be at the core of building and scaling the technical infrastructure for AI/ML systems. You will:
- Build reusable CI/CD workflows for model training, evaluation, and deployment — integrating Langfuse, GitHub Actions, and experiment tracking, etc.
- Automate model versioning, approval workflows, and compliance checks across environments.
- Build out a modular and scalable AI infrastructure stack — including vector databases, feature stores, model registries, and observability tooling.
- Partner with engineering and data science to embed AI models and agents into real-time applications and workflows.
- Continuously evaluate and integrate state-of-the-art AI tools (e.g. LangChain, LlamaIndex, vLLM, MLflow, BentoML, etc.).
- Drive AI reliability and governance, enabling experimentation while ensuring compliance, security, and uptime.
- Build and enhance AI/ML Model Performance
- Ensure data accuracy, consistency and reliability, leading to better model training and inferencing
- Deploy infrastructure to support offline and online evaluation of LLMs and agents — including regression testing, cost monitoring, and human-in-the-loop workflows.
- Enable researchers to iterate quickly by providing sandboxes, dashboards, and reproducible environments.
What We’re Looking For
- Write high-quality, maintainable software — primarily in Python, but we value engineering ability over language familiarity.
- Have a strong background in scalable infrastructure, including:
- Containerization and orchestration (e.g. Docker, Kubernetes)
- Infrastructure-as-code and deployment (e.g. Terraform, CI/CD pipelines)
- Monitoring and logging frameworks (e.g. Datadog, Prometheus, OpenTelemetry)
- Understand and implement ML Ops best practices, including:
- Model versioning and rollback strategies
- Automated evaluation and drift detection
- Scalable model and agent serving infrastructure (e.g. vLLM, Triton, BentoML)
- Deploy and maintain LLM and agentic workflows in production, including:
- Monitoring cost, latency, and performance
- Capturing traces for analysis and debugging
- Optimizing prompt/response flows with real-time data access
- Demonstrate strong ownership and pragmatism, balancing infrastructure elegance with iterative delivery and measurable impact.
Learn about TRM Speed in this position
- Rapid Issue Resolution. TRM Engineers identify and resolve critical onsite issues in minutes to hours, not weeks. We create virtual war rooms, implement fixes, and share lessons with both customer stakeholders and internal teams within 48 hours.
- Navigating Bureaucracy. We anticipate and address procedural hurdles, build trust with key stakeholders, and find alternative pathways to approvals. This keeps projects moving even in complex environments.
- Efficient Knowledge Transfer. Engineers document and share updates in real time, ensuring the entire team—onsite and remote—has full visibility into plans, blockers, and resolutions. Knowledge sharing sessions and clear documentation reduce friction and accelerate delivery.
About TRM's Engineering Levels
Engineer: Responsible for helping to define project milestones and executing small decision decisions independently with the appropriate tradeoffs between simplicity, readability, and performance. Provides mentorship to junior engineers, and enhances operational excellence through tech debt reduction and knowledge sharing.
Senior Engineer: Successfully designs and documents system improvements and features for an OKR/project from the ground up. Consistently delivers efficient and reusable systems, optimizes team throughput with appropriate tradeoffs, mentors team members, and enhances cross-team collaboration through documentation and knowledge sharing.
Staff Engineer: Drives scoping and execution of one or more OKRs/projects that impact multiple teams. Partners with stakeholders to set the team vision and technical roadmaps for one or more products. Is a role model and mentor to the entire engineering organization. Ensures system health and quality with operational reviews, testing strategies, and monitoring rigor.
Life at TRM
We are building a safer world. That promise shows up in how we work every day.
TRM moves quickly. We are a high velocity, high ownership team that expects clarity, follow-through, and impact. People who thrive here are energized by hard problems, experimentation, and continuous feedback. If something takes months elsewhere, it will ship here in days.
Our work sits at the intersection of AI, national security, and fighting crime. The problems are complex, the stakes are real, and the environment evolves quickly. The pace and intensity of the work reflect the importance of the mission. As a result, the way we operate requires a high level of ownership, adaptability, collaboration, and creative problem-solving.
At TRM, you should expect:
- Priorities and targets to change quickly as we experiment and iterate
- Work that often requires operating with a high degree of ambiguity
- A high level of personal ownership and accountability
- Close collaboration across teams and functions
- Frequent, high-touch communication
- Creative problem solving and out-of-the-box thinking
- A pace that rewards urgency, adaptability, and outcomes
This environment is energizing for people who enjoy building, solving hard problems, and making progress in situations that are not always fully defined. It also requires comfort navigating ambiguity, adjusting course as new information emerges