logo

ABBYY

Senior Machine Learning Engineer, Synthetic Data & Document Understanding

Department
Engineering
Job Type / Location
Bangalore
Experience Required
5+ years
Posted On

About the Role

We are seeking a Senior Machine Learning Engineer – Synthetic Data & Document Understanding to own the synthetic data generation track within ABBYY’s Document AI Data team. This role focuses on building generative pipelines that produce high-quality, diverse, and realistic synthetic training data at scale. You will ensure synthetic data meaningfully improves downstream model performance by maintaining strong alignment with real-world document structures, formats, and statistical properties. This is an ideal role for engineers who combine deep generative modeling expertise with rigorous data quality evaluation and production engineering skills.

Key Responsibilities

Technical Development & Innovation

  • Design and implement pipelines that analyze real documents to inform high-fidelity synthetic data generation
  • Build generative systems capable of producing documents across diverse formats, layouts, and domains
  • Develop evaluation frameworks to ensure synthetic data maintains distributional fidelity and diversity
  • Research and apply generative modeling techniques suited for document AI training
  • Identify and mitigate quality issues to ensure synthetic data is effective for downstream model training
  • Partner with Modeling teams to measure the impact of synthetic data on model performance

Project Ownership & Leadership

  • Own the synthetic data generation track end-to-end, from architecture to quality validation
  • Drive architectural decisions balancing quality, diversity, scale, and cost efficiency
  • Define and maintain data quality metrics and generation dashboards
  • Collaborate closely with annotation teams to ensure compatibility with downstream pipelines
  • Contribute to roadmap planning alongside Principal-level leadership

Infrastructure & Scale

  • Build scalable pipelines capable of generating millions of synthetic training examples
  • Implement post-processing, filtering, and validation mechanisms to remove low-quality outputs
  • Design cost-efficient workflows balancing compute, quality, and throughput
  • Develop monitoring systems to detect distribution shifts or quality degradation over time
  • Collaborate with Platform teams on compute orchestration, storage, and scheduling

Qualifications

Education & Experience

  • MS or PhD in Computer Science, Engineering, Mathematics, or related field
  • 5+ years of experience in Machine Learning / AI, with focus on:
    • Generative models
    • Vision-Language Models (VLMs)
    • Synthetic data systems
  • Proven experience building and evaluating synthetic data pipelines for ML training
  • Strong background in data quality evaluation and statistical analysis

Technical Expertise

  • Deep expertise in Vision-Language Models and document understanding (layout, structure, semantics)
  • Strong knowledge of generative modeling for structured and semi-structured data
  • Understanding of what makes synthetic data valuable:
    • Distributional fidelity
    • Diversity
    • Realistic noise patterns
    • Domain coverage
  • Strong programming skills in Python with experience in PyTorch or similar frameworks
  • Experience evaluating data quality via automated metrics and downstream model impact
  • Familiarity with large-scale data pipelines, cloud environments, and experiment tracking

Leadership & Communication

  • Proven ability to independently own complex technical workstreams
  • Strong collaboration across data, modeling, and platform teams
  • Ability to clearly communicate data quality and generation trade-offs
  • Data-driven mindset with strong attention to coverage gaps and quality signals

View Assessment Process

Think you'll be a good fit?