Posted in

Distributed Systems Engineer – AI Inference Platform

Distributed Systems Engineer – AI Inference Platform

CompanyCerebras
LocationToronto, ON, Canada, Sunnyvale, CA, USA
Salary$Not Provided – $Not Provided
TypeFull-Time
DegreesBachelor’s, Master’s
Experience LevelSenior

Requirements

  • Bachelor’s or master’s degree in computer science or related field, or equivalent practical experience.
  • 5+ years of software engineering experience, with a strong focus on distributed systems architecture and optimization.
  • Deep understanding of distributed systems principles.
  • Proven experience with container orchestration technologies, particularly Kubernetes (K8s).
  • Strong programming skills in Python. C++ experience is a plus.
  • Experience with distributed messaging systems or RPC frameworks.
  • Experience designing for high availability, fault tolerance, and scalability.
  • Strong debugging and performance analysis skills in distributed environments.
  • Familiarity with cloud-native technologies and microservices architectures.

Responsibilities

  • Design, build, and operate foundational distributed systems components that power the Inference Platform with high availability, scalability, and performance.
  • Architect and implement the core logic for distributed request routing, dynamic load balancing, replica synchronization, and distributed metadata management.
  • Develop and enhance the fault tolerance and auto-recovery mechanisms for platform services and inference replicas.
  • Optimize communication patterns and data flow between microservices to ensure minimal latency and maximal throughput at scale.
  • Contribute to the design and implementation of the distributed orchestration and scheduling system for managing inference workloads and resources.
  • Implement and refine monitoring, tracing, and alerting for distributed system components to ensure operational excellence.
  • Collaborate closely with hardware, ML, and other software teams to ensure seamless integration and end-to-end system performance.
  • Debug complex issues spanning multiple services and systems in a distributed environment.

Preferred Qualifications

    No preferred qualifications provided.