Who Are We Looking For:
We are seeking a experienced Data Engineer (Senior) to build and maintain data infrastructure to convert our research into scalable, production-ready solutions for synthetic tabular data generation. You will also architect and operate our large-scale data curation, scraping, and cleaning pipelines to deliver massive amounts of datasets for pretraining and finetuning large language models on tabular and unstructured domains.
This is an individual contributor (IC) role suited for someone who thrives in a fast-paced, early-stage start-up environment. The ideal candidate has experience scaling data and machine learning systems to handle datasets with billions of records and can build and optimize complex data pipelines for enterprise applications. You'll work closely with software, machine learning and applied research teams to optimize performance and ensure seamless integration of systems, handling data from financial institutions, government agencies, consumer brands and more.
Key Responsibilities:
Data Infrastructure and Pipeline Development:
- Build data ingestion pipelines from enterprise relational databases (e.g. Oracle, SQL Server, PostgreSQL, MySQL, Databricks, Snowflake, BigQuery) and files (e.g. Parquet, CSV) for large-scale synthetic data pipelines.
- Design scalable data pipelines for batch processing, deciding between distributed computing tools like Spark, Dask or Ray when handling extremely large datasets across multiple nodes or use single-node tools like Polars and DuckDB for more lightweight, efficient operations.
- Architect and maintain data warehouses and data lakes (e.g. Delta Lake) optimized for synthetic data training and generation workflows.
- Seamlessly transform Pandas-based research code into production-ready pipelines.
- Build automated data quality monitoring and validation systems to ensure data integrity throughout the pipeline lifecycle.
- Implement comprehensive data lineage tracking and audit capabilities for regulatory compliance and privacy validation.
- Design robust error handling mechanisms, with automatic retries and data recovery in case of pipeline failures.
- Track performance metrics such as data throughput, latency, and processing times to ensure efficient pipeline operations at scale.
- Implement monitoring and alerting (e.g. Prometheus, Grafana) for pipeline health, throughput, and data quality metrics.
- Optimize resource allocation and cost efficiency for distributed processing at terabytes to petabyte scale.
Massive-Scale Data Collection & Ingestion:
- Design and build distributed web scraping clusters to extract data from millions of pages.
- Build LLM-aided data filtering systems combining automated model scoring to evaluate and prioritize high-quality content.
Strong understanding of ML concepts and algorithms:
- Construct deep learning data pipelines for training generative models (GANs, VAEs, Transformers, Diffusion Models) on large-scale datasets