About Kumo.ai
Kumo is building the infrastructure layer for the next generation of enterprise AI — a platform that lets organizations turn their data into predictive intelligence instantly, without the heavy lifting of traditional ML pipelines. We have also built our own Relational Foundation Model that can provide predictions in seconds – no training, straight to business value!
Join a dynamic, rapidly expanding team of innovators from top-tier companies like Airbnb, LinkedIn, Pinterest, and Stanford, supported by the renowned Sequoia Capital. We're on the front lines of AI, solving some of its most challenging and impactful problems, and we've already delivered over $500M+ in tangible value to industry giants like Reddit, DoorDash, and Databricks. If you thrive in a fast-paced environment, are driven by ambitious goals, and crave an opportunity for massive impact, this is your chance to shape the future of AI.
The Opportunity
Kumo’s platform runs thousands of predictive workloads across multi-tenant Kubernetes clusters that form the backbone of our AI stack. As an Cloud Infrastructure Engineer you’ll own, scale, and optimize that platform — from real-time inference to large-scale training — with real production impact. You’ll make high-leverage architectural decisions, ship quickly, and collaborate across ML, product, and engineering teams to expand our multi-cloud capabilities. Expect to move fast, iterate often, and see your changes land in production within days — not quarters.
What You’ll Do
- Design, build, and evolve Kumo’s multi-tenant infrastructure to support massive AI and data workloads across AWS, Azure, and GCP.
- Implement and maintain infrastructure-as-code to automate training and deployment pipelines across many environments.
- Operate and scale Kubernetes clusters with a focus on reliability, performance, availability, tenant isolation, and cost efficiency.
- Build observability and alerting into distributed systems using Prometheus, Grafana, OpenTelemetry, and related tooling.
- Partner closely with ML researchers and product teams to deliver production-grade infrastructure for advanced AI workloads.
- Drive security and operational best-practices (RBAC, tenant isolation, cloud identity, etc.) across our platform.
What You Bring
- 3–5 years building or operating cloud-native infrastructure in production.
- Hands-on experience with at least one major cloud (AWS / Azure / GCP); multi-cloud exposure is a plus.
- Operational experience with Kubernetes and production-grade clusters.
- Proficiency with Infrastructure-as-Code (Terraform, Pulumi, etc.) and familiarity with GitOps tooling (ArgoCD, Flux, Argo Workflows).
- Strong debugging, systems-thinking, and communication skills — you can drive technical decisions and explain trade-offs to multiple stakeholders.
Nice to Have
- Experience operating multi-tenant Kubernetes for data / AI workloads.
- Experience with (managed) Spark or large-scale data processing systems.
- Familiarity with Kubernetes operators, controllers, and custom resources.
- Deep experience with monitoring/tracing/logging stacks (Prometheus, OpenTelemetry, etc.)
We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.