About the role:
Our Scalability and Capability Inference team is responsible for building and maintaining the critical systems that serve our LLMs to a diverse set of consumers. As the cornerstone of our service delivery, the team focuses on scaling inference systems, ensuring reliability, optimizing compute resource efficiency, and developing new inference capabilities. The team tackles complex distributed systems challenges across our entire inference stack, from optimal request routing to efficient prompt caching.
You may be a good fit if you:
- Have significant software engineering experience
- Are results-oriented, with a bias towards flexibility and impact
- Pick up slack, even if it goes outside your job description
- Enjoy pair programming (we love to pair!)
- Want to learn more about machine learning research
- Care about the societal impacts of your work
Strong candidates may also have experience with:
- High performance, large-scale distributed systems
- Implementing and deploying machine learning systems at scale
- LLM optimization batching and caching strategies
- Kubernetes
- Python
Representative projects:
- Optimizing inference request routing to maximize compute efficiency
- Autoscaling our compute fleet to effectively match compute supply with inference demand
- Contributing to new inference features (e.g. structured sampling, fine tuning)
- Supporting inference for new model architectures
- Ensuring smooth and regular deployment of inference services
- Analyzing observability data to tune performance based on production workloads
Deadline to apply: None. Applications will be reviewed on a rolling basis.