About the role
Anthropic’s Compute team is looking for a Principal Capacity Engineer to lead capacity planning, forecasting, and optimizing of our global infrastructure fleet. You’ll work closely with research, engineering, and finance teams to ensure we have scalable systems for capacity management, high-quality data and insights for planning, and engineering roadmaps that deliver efficiency wins and increase total effective compute. Experience with capacity management for AI workloads is preferred.
Responsibilities:
- Design, develop, and deliver capacity management systems for AI workloads on heterogenous infrastructure
- Build and maintain robust attribution of usage and enable in-depth data-driven insights
- Oversee design and implementation of planning tools and systems-level guardrails for capacity planning and quota management
- Build a deep understanding of research and training workloads to accurately model cost-to-serve and cost-to-train
- Proactively identify efficiency opportunities and collaborate with teams across the org to increase total effective compute for Anthropic
- Partner closely with Finance and leadership, providing detailed and clear capacity inputs for financial planning and strategic decision making
You may be a good fit if you:
- Have experience working on capacity at a major cloud provider or hyperscaler company
- Have experience driving cross-functional projects and interfacing with technical and non-technical stakeholders.
- Have experience working with LLMs and/or a deep interest in learning about model training and serving efficiency
- Are comfortable leveraging data and have experience building observability for complex systems
- Have strong interpersonal skills that enable you to influence without authority and build cross-organizational support for capacity initiatives.
Strong candidates may also have some of the following:
- Past experience as a lead capacity engineer
- Past experience partnering with senior leadership
- Past experience working on model training or model inference
Representative Projects:
- Building a system for capacity planning and optimizing resource allocation for model training, inference, and research
Deadline to apply: None. Applications will be reviewed on a rolling basis.