logo

dv01

AI/MLOps Platform Engineer

Department
Engineering
Job Type / Location
remote
Experience Required
8+ years
Posted On

About dv01

dv01 is dedicated to bringing transparency to the structured finance market, a $16+ trillion sector crucial for financial freedom through activities like debt consolidation, loan refinancing, home buying, and supporting small businesses. Our data analytics platform provides unparalleled insight into investment performance and risk for lenders and Wall Street investors in structured products. As a data-first company, we manage critical loan data and develop modern analytical tools to enable strategic decision-making and promote responsible lending. Our mission is to help prevent future financial crises by providing the necessary data and tools for smarter, data-driven decisions, contributing to a safer global financial environment.

Over 400 major financial institutions rely on dv01 for our comprehensive coverage of more than 100 million loans, encompassing mortgages, personal loans, auto loans, buy-now-pay-later programs, small business loans, and student loans. dv01 consistently expands into new markets, adds loans monthly, and innovates new technologies for the structured products universe.

About the Role

As an AI/MLOps Platform Engineer, you will be instrumental in designing, building, and operating a cutting-edge AI infrastructure platform that accelerates AI development across the company. You will focus on the operational foundations of AI systems, ensuring efficient, scalable, and secure deployment of AI-powered services.

Responsibilities

  • Build and operate an AI infrastructure platform: Design, build, and operate cloud-native infrastructure and platform tooling to accelerate AI development, enabling teams to develop, deploy, and operate AI-powered services safely and efficiently in production.
  • Own the DevOps and infrastructure side of AI Ops and Agentic Systems: Focus on the operational foundations of AI systems, including CI/CD for AI workloads, scalable inference infrastructure, observability, cost management, and reliability. Establish repeatable patterns and shared services to reduce friction for teams building AI-enabled applications.
  • Enable AI services, agents, and runtime platforms: Build and maintain infrastructure to support AI services such as LLM-backed APIs, Model Context Protocol (MCP) servers, and agentic systems used by production applications. Enable secure tool access, runtime orchestration, and isolation boundaries for AI-driven workloads.
  • Integrate AI Ops capabilities into platform operations: Apply MLOps concepts to improve platform operations, including using AI-driven approaches for monitoring, alerting, anomaly detection, and incident response across AI and non-AI systems. Evolve how the platform observes and operates complex AI-enabled systems at scale.
  • Establish governance, security, and operational guardrails: Define and implement infrastructure-level governance for AI systems, including access controls, deployment policies, auditability, and secure-by-default patterns. Partner with security and compliance teams to ensure AI infrastructure aligns with organizational risk and regulatory requirements.
  • Provide technical leadership and enablement: Act as a technical leader, influencing platform architecture and best practices. Mentor engineers and work closely with product, data, and application teams to align AI platform capabilities with business goals.
  • Contribute to dv01’s AI and data roadmap.
  • Establish technical direction and strategy.
  • Mentor and provide technical guidance to junior engineers.

Requirements

  • Senior Cloud and Platform Engineer: 8+ years of experience in cloud infrastructure, DevOps, or platform engineering roles, with deep expertise designing and operating distributed systems in production.
  • Experienced with AI Ops and Agentic Platforms: Direct exposure to ML/GenAIOps practices, such as monitoring, anomaly detection, predictive alerting, or automated remediation, applied to real production systems. 5+ years of MLOps experience is required.
  • Strong in Cloud-Native Infrastructure: Proficient in building and managing cloud environments, Kubernetes, containerized workloads, and infrastructure-as-code tools such as Terraform.
  • Comfortable Supporting AI Workloads: Hands-on experience supporting platforms that host/run deep neural networks, including LLM runtimes (e.g., vLLM, llama.cpp), ML compiler stacks (e.g., LLVM/MLIR), and PyTorch-based production systems.
  • Security- and Operations-Minded: Strong understanding of infrastructure security, IAM, secrets management, and operational risk as it relates to AI-enabled systems.
  • Platform-Focused Technical Leader: Operate effectively as a technical leader, influencing architecture and standards while remaining hands-on. Communicate clearly, collaborate well cross-functionally, and thrive in ambiguous problem spaces.
  • Forward-Thinking and Pragmatic: Proactive and innovative, with the ability to introduce emerging agentic patterns while balancing operational maturity and long-term maintainability. Design and operate scalable benchmarking and evaluation frameworks for agentic AI systems, enabling quantitative measurement of accuracy, reliability, cost–performance tradeoffs, regression detection, and the impact of model, prompt, or architecture changes (including techniques such as LLM-as-a-judge), with tooling that is reusable and accessible across the organization.

Nice To Have

  • Experience with Pulumi.
  • Experience with GCP and Cloudflare.
  • Experience with GHA and Harness.
  • Experience with Go lang.
  • Experience supporting Data Engineering Platforms.
  • Exposure to Data Warehousing and ETL/ELT Tools or Operations.

View Assessment Process

Think you'll be a good fit?