About the role
Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems. Anthropic is at the forefront of AI research, dedicated to developing safe, ethical, and powerful artificial intelligence. Our mission is to ensure that transformative AI systems are aligned with human interests. We are seeking a Research Engineer to join our Pretraining team, responsible for developing the next generation of large language models. In this role, you will work at the intersection of cutting-edge research and practical engineering, contributing to the development of safe, steerable, and trustworthy AI systems.
Responsibilities:
- Design and implement high-performance ML training infrastructure for large language model research
- Develop and maintain core ML framework primitives in JAX, PyTorch, etc.
- Create robust automated evaluation and benchmarking systems for model performance
- Implement comprehensive monitoring and debugging tools for ML workflows
- Design and optimize data loading pipelines that maximize training throughput
- Build MLOps tooling to support reproducible research and experimentation
- Collaborate with research teams to prototype and scale novel training architectures
- Develop infrastructure for efficient hyperparameter sweeps and architecture search
You may be a good fit if you have:
- Strong software engineering skills with experience in building distributed systems
- Expertise in Python and experience with distributed computing frameworks
- Deep understanding of cloud computing platforms and distributed systems architecture
- Experience with high-throughput, fault-tolerant system design
- Strong background in performance optimization and system scaling
- Excellent problem-solving skills and attention to detail
- Strong communication skills and ability to work in a collaborative environment
Strong candidates may have:
- Advanced degree (MS or PhD) in Computer Science or related field
- Experience with language model training infrastructure
- Strong background in distributed systems and parallel computing
- Expertise in tokenization algorithms and techniques
- Experience building high-throughput, fault-tolerant systems
- Deep knowledge of monitoring and observability practices
- Experience with infrastructure-as-code and configuration management
- Background in MLOps or ML infrastructure