Who Are We Looking For:

We are seeking a experienced Data Engineer (Senior) to build and maintain data infrastructure to convert our research into scalable, production-ready solutions for synthetic tabular data generation. You will also architect and operate our large-scale data curation, scraping, and cleaning pipelines to deliver massive amounts of datasets for pretraining and finetuning large language models on tabular and unstructured domains.

This is an individual contributor (IC) role suited for someone who thrives in a fast-paced, early-stage start-up environment. The ideal candidate has experience scaling data and machine learning systems to handle datasets with billions of records and can build and optimize complex data pipelines for enterprise applications. You'll work closely with software, machine learning and applied research teams to optimize performance and ensure seamless integration of systems, handling data from financial institutions, government agencies, consumer brands and more.

Key Responsibilities:

Data Infrastructure and Pipeline Development:

Massive-Scale Data Collection & Ingestion:

Strong understanding of ML concepts and algorithms: