Scale GP is building the next generation of enterprise-grade Generative AI products. Our platform provides APIs for knowledge retrieval, inference, and evaluation, enabling customers to build and deploy powerful Agentic workflows for Enterprise use cases. We're looking for a Senior Infrastructure Software Engineer to build and scale our core infrastructure in a fast-paced environment. This team is key to our mission, directly enabling the deployment of these agentic flows for our customers.This is a unique opportunity for an infrastructure leader who is passionate about defining the future of AI deployments. You will be at the forefront of the industry, solving complex, bleeding-edge problems in scalability, security, and developer efficiency. You will architect and implement solutions across multiple cloud providers (GCP, Azure, AWS) for customers in diverse, highly-regulated industries like healthcare, telecom, finance, and retail.
What You'll Do:
- Define the architectural patterns for our multi-cloud infrastructure to support secure, reliable, and scalable Agentic workflows for enterprise customers.
- Lead the infrastructure roadmap with a strong focus on compliance, privacy, and security standards, including designing change management and data isolation strategies.
- Own the development and maintenance of our best-in-class Agentic observability platform (logging, metrics, tracing, and analytics) to proactively ensure system health and enable rapid incident response.
- Drive developer efficiency by building automated tooling and championing Infrastructure-as-Code (IaC) paradigms throughout the engineering organization.
- Solve the toughest engineering problems related to multi-tenancy, data isolation, and high-performance inference at a massive scale, taking end-to-end ownership across the full product lifecycle.
What We're Looking For:
- Proven experience in a senior role, with 5+ years of full-time software engineering experience.
- Deep understanding of modern infrastructure practices, including CI/CD, IaC (e.g., Terraform, Helm Charts), container orchestration (e.g., Kubernetes) and observability platforms (e.g., Datadog, Prometheus, Grafana).
- Extensive experience with at least one major cloud provider (AWS, Azure, or GCP).
- Strong knowledge of security and compliance in enterprise environments, with a focus on access management, data isolation, and customer-specific VPC setups.
- Proficiency in Python or JavaScript/TypeScript, and SQL.
- Bonus points: Hands-on experience and a passion for working with Agents, LLMs, vector databases, and other emerging AI technologies.