Who are we looking for
We seek an experienced Machine Learning Operations Engineer (Senior) to transform cutting-edge research into robust, production-ready services for synthetic data generation and to optimize both deep learning and classical ML algorithms (e.g. tree-based models) at enterprise scale (billions of rows). You will build and tune model pipelines end-to-end, ensuring high performance, scalability, and reliability across diverse workloads and dataset sizes.
Key Responsibilities:
Algorithm Optimization & Scaling
- Optimize bottlenecks of the deep generative models to accelerate training and generation of generative models (e.g. transformer, diffusion, GANs).
- Implement distributed training of the models across multi-GPU clusters.
- Optimize distributed training of traditional ML models (e.g. XGBoost, LightGBM, CatBoost) on billion-row datasets.
- Design best practices for memory management to maximize resource utilization (compute and memory), enabling faster training at lower cost.
Data Handling at Scale
- Collaborate with data engineers to design ETL/ELT workflows handling terabyte to petabyte scale tabular and unstructured data.
- Implement scalable feature engineering pipelines using distributed computing frameworks (e.g. Spark, Dask, or Ray).
- Automate data validation (e.g. schema checks, anomaly detection) with rule-based and ML-driven frameworks.
End to end orchestration
- Build ML pipelines that transition research prototypes into reliable production-grade workflow.
- Package models into Docker containers and deploy using Kubernetes.
- Build automated model and data quality monitoring and validation systems to ensure data integrity throughout the pipeline lifecycle.
- Design robust error handling mechanisms, with automatic retries and data recovery in case of pipeline failures.
- Implement logging, monitoring and alerting systems.