Director of Engineering – Platform Engineering

Hippocratic AI • Full-time • Palo Alto, California, USA • 20h ago

About Us

Hippocratic AI has developed a safety-focused Large Language Model (LLM) for healthcare. The company believes that a safe LLM can dramatically improve healthcare accessibility and health outcomes in the world by bringing deep healthcare expertise to every human. No other technology has the potential to have this level of global impact on health.

Why Join Our Team

Innovative Mission: We are developing a safe, healthcare-focused large language model (LLM) designed to revolutionize health outcomes on a global scale.
Visionary Leadership: Hippocratic AI was co-founded by CEO Munjal Shah, alongside a group of physicians, hospital administrators, healthcare professionals, and artificial intelligence researchers from leading institutions, including El Camino Health, Johns Hopkins, Stanford, Microsoft, Google, and NVIDIA.
Strategic Investors: We have raised a total of $278 million in funding, backed by top investors such as Andreessen Horowitz, General Catalyst, Kleiner Perkins, NVIDIA’s NVentures, Premji Invest, SV Angel, and six health systems.
World-Class Team: Our team is composed of leading experts in healthcare and artificial intelligence, ensuring our technology is safe, effective, and capable of delivering meaningful improvements to healthcare delivery and outcomes.

For more information, visit www.HippocraticAI.com.

We value in-person teamwork and believe the best ideas happen together. Our team is expected to be in the office five days a week in Palo Alto, CA, unless explicitly noted otherwise in the job description.

About the Role

We are seeking a Director of Engineering – Platform Engineering to lead the design, implementation, and operation of HippocraticAI’s cloud infrastructure, observability systems, and GPU control plane. This leader will be responsible for scaling our global compute fabric to support cutting-edge LLM workloads while maintaining exceptional reliability, security, and cost efficiency.

You will build and lead a multidisciplinary engineering team spanning cloud operations, SRE, and GPU orchestration, working closely with product development, AI research, and compliance to deliver world-class infrastructure for healthcare AI.

What You'll Do:

Team Building & Leadership

Recruit and develop a world-class platform engineering team
Foster a culture of innovation, accountability, and technical excellence.
Mentor and coach engineers and managers to achieve high performance and career growth.

Infrastructure Leadership

Build and scale a high-performing team responsible for all infrastructure operations and systems reliability.
Define and execute the long-term infrastructure roadmap for a multi-cloud, multi-region GPU and compute environment.
Drive excellence in cloud cost optimization, capacity planning, and service reliability.

Cloud Operations & Control Plane

Architect and manage HippocraticAI’s global GPU control plane, enabling dynamic provisioning, scheduling, and monitoring of inference workloads across regions and providers.
Lead the design and automation of deployments (AWS, GCP, Azure, on-prem) using infrastructure-as-code and CI/CD best practices.
Ensure strong security posture and compliance across all environments, aligned with HIPAA, SOC 2, and other healthcare data standards.

Observability & Reliability

Develop and scale comprehensive observability systems—covering telemetry, tracing, logging, and alerting—to ensure full visibility into production systems and AI workloads.
Establish SLOs, SLIs, and SLAs for all mission-critical services and infrastructure.
Implement robust incident management, root cause analysis, and continuous improvement processes.

Technical Strategy & Collaboration

Partner with AI and product teams to anticipate infrastructure needs and design scalable architectures for rapid experimentation and deployment.
Contribute to the design of internal developer platforms that improve productivity and standardization.
Evaluate emerging technologies (e.g., new GPU hardware, orchestration frameworks, data center partnerships) to advance our capabilities.

What You Bring

Must Have:

10+ years of engineering experience, including 5+ years leading infrastructure, SRE, or platform teams at scale.
Proven success in managing large-scale distributed systems and global cloud infrastructure.
Deep experience with high-performance computing or large-scale AI workloads.
Strong background in cloud platforms (AWS, GCP, Azure) and infrastructure-as-code (Terraform, Pulumi, etc.).
Expertise in observability stacks (Prometheus, Grafana, OpenTelemetry, Datadog, etc.) and operational excellence.
Experience with security and compliance frameworks relevant to healthcare (HIPAA, SOC 2).
Exceptional communication skills and the ability to partner across product, AI research, and operations.

Nice-to-Have:

Experience designing or operating GPU control planes or schedulers (e.g., Kubernetes, Ray, Slurm, custom orchestration frameworks).
Prior work with ML infrastructure, data pipelines, or model-serving platforms.
Background in cost optimization and sustainability of GPU/compute operations.
Familiarity with edge or hybrid-cloud deployments for low-latency AI systems.

***Be aware of recruitment scams impersonating Hippocratic AI. All recruiting communication will come from @hippocraticai.com email addresses. We will never request payment or sensitive personal information during the hiring process. If anything