Who Are We Looking For:
We are seeking a Senior Data & Machine Learning Engineer with hands-on experience to transform academic research into scalable, production-ready solutions for synthetic tabular data generation. This is an individual contributor (IC) role suited for someone who thrives in a fast-paced, early-stage startup environment. The ideal candidate has extensive experience scaling systems to handle datasets with hundreds of millions to billions of records and can build and optimize complex data pipelines for enterprise applications.
This role requires someone familiar with the dynamic nature of a startup, capable of rapidly designing and implementing scalable solutions. You'll work closely with research teams to optimize performance and ensure seamless integration of systems, handling data from financial institutions, government agencies, consumer brands, and internet companies.
Key Responsibilities:
Strong understanding of ML concepts and algorithms:
Practical experience working with models in production settings in AI / data science teams to transform AI / data science code into scalable, production-ready systems.
Data Ingestion & Integration:
- Ingest data from enterprise relational databases such as Oracle, SQL Server, PostgreSQL, and MySQL, as well as enterprise SQL-based data warehouses like Snowflake, BigQuery, Redshift, Azure Synapse, and Teradata for large-scale analytics.
Data Validation & Quality Assurance:
- Ensure ingested data conforms to predefined schemas, checking data types, missing values, and field constraints.
- Implement data quality checks for nulls, outliers, and duplicates to ensure data reliability.
Data Transformation & Processing:
- Design scalable data pipelines for batch processing, deciding between distributed computing tools like Spark, Dask, or Ray when handling extremely large datasets across multiple nodes, and single-node tools like Polars and DuckDB for more lightweight, efficient operations. The choice will depend on the size of the data, system resources, and performance requirements.
- Leverage Polars for high-speed, in-memory data manipulation when working with large datasets that can be processed efficiently in-memory on a single node.
- Utilize DuckDB for on-disk query execution, offering SQL-like operations with minimal overhead, suitable for environments that need a balance between memory use and query performance.
- Seamlessly transform Pandas-based research code into production-ready pipelines, ensuring efficient memory usage and fast data access without adding unnecessary complexity.
Data Storage & Retrieval:
- Work with internal data representations such as Parquet, Arrow, and CSV to support the needs of our generative models, choosing the appropriate format based on data processing and performance needs.
Distributed Systems & Scalability:
- Ensure that the system can scale efficiently from a single node to multiple nodes, providing graceful scaling for users with varying compute capacities.