Scale’s AI Infrastructure team supports both R&D and applied Generative AI initiatives, driving breakthroughs in areas of post-training research such as AI safety, agents, and evaluating state-of-the-art model performance.
As a Data Infrastructure Engineer on the AI Infrastructure team, you will design, build, and scale the data platform that powers all R&D and applied ML initiatives at Scale. Collaborating closely with product engineering, platform engineering, and ML researchers, you will build robust and easy-to-use APIs and data pipelines. Your work will play a critical role in advancing frontier ML research, accelerating the data sales cycle, and improving data quality - all while optimizing infrastructure costs.
You will:
- Design, implement, and maintain scalable data platforms to support diverse R&D and applied ML workloads.
- Partner with ML researchers, product engineers, and operations teams to align data infrastructure with organizational goals.
- Collaborate with ML researchers to build data access tools that help advance the state of frontier post-training research.
- Participate in our team’s on call process to ensure the availability of our services.
- Own projects end-to-end, from requirements, scoping, design, to implementation, in a highly collaborative and cross-functional environment.
Ideally you'd have:
- 2+ years of experience in building and operating large-scale distributed data systems that support ML workloads.
- Expertise in modern data platform technologies (e.g. Snowflake, Delta Lake, Spark, Flink, Ray, etc.) and data engineering practices.
- Experience working with standard containerization & deployment technologies like Kubernetes, Helm, Terraform, Docker, etc.
- Strong problem solving skills and the ability to work effectively in a fast paced, dynamic environment.
Nice to haves:
- Knowledge of AI/ML frameworks and libraries (e.g. PyTorch, HuggingFace)
- Experience with document (MongoDB), relational (Postgres), blob (S3) and highly distributed databases (Redis, ElasticSearch).
- Strong experience with orchestration platforms, such as Temporal and AWS Step Functions.
- Experience scaling products at hyper-growth startups.