About the role:
You will be responsible for pretraining data research. You may be working on understanding pretraining data trends and scaling laws, optimizing pretraining data mixes, investigating potential new sources of data, building research tools to better understand experimental results, or figuring out how to process and use pretraining data most effectively.
You may be a good fit if you:
- Have significant software engineering experience
- Are results-oriented, with a bias towards flexibility and impact
- Pick up slack, even if it goes outside your job description
- Are comfortable with a very empirical research environment
- Care about the societal impacts of your work
Strong candidates may also have experience with:
- High performance, large-scale ML systems
- Language modeling with transformers
- Large-scale ETL
- Designing ML experiments and researching ML fundamentals
- Inspecting and iterating on data (e.g. ML competitions, Quantitative Finance)
Representative projects:
- Comparing the compute efficiency of different datasets
- Making a multimodal dataset in a format models can easily consume
- Scaling a data processing job to thousands of machines
- Designing a research tool to analyze and manage data ablation experiments
- Creating an interactive visualization of semantic clusters in our training data