Who are we looking for

We seek an experienced Machine Learning Operations Engineer (Senior) to transform cutting-edge research into robust, production-ready services for synthetic data generation and to optimize both deep learning and classical ML algorithms (e.g. tree-based models) at enterprise scale (billions of rows). You will build and tune model pipelines end-to-end, ensuring high performance, scalability, and reliability across diverse workloads and dataset sizes.

Key Responsibilities:

Algorithm Optimization & Scaling

Optimize bottlenecks of the deep generative models to accelerate training and generation of generative models (e.g. transformer, diffusion, GANs).
Implement distributed training of the models across multi-GPU clusters.
Optimize distributed training of traditional ML models (e.g. XGBoost, LightGBM, CatBoost) on billion-row datasets.
Design best practices for memory management to maximize resource utilization (compute and memory), enabling faster training at lower cost.

Data Handling at Scale

Collaborate with data engineers to design ETL/ELT workflows handling terabyte to petabyte scale tabular and unstructured data.
Implement scalable feature engineering pipelines using distributed computing frameworks (e.g. Spark, Dask, or Ray).
Automate data validation (e.g. schema checks, anomaly detection) with rule-based and ML-driven frameworks.

End to end orchestration

Build ML pipelines that transition research prototypes into reliable production-grade workflow.
Package models into Docker containers and deploy using Kubernetes.
Build automated model and data quality monitoring and validation systems to ensure data integrity throughout the pipeline lifecycle.
Design robust error handling mechanisms, with automatic retries and data recovery in case of pipeline failures.
Implement logging, monitoring and alerting systems.