Distributed Systems Engineer – AI Inference Platform
Company | Cerebras |
---|---|
Location | Toronto, ON, Canada, Sunnyvale, CA, USA |
Salary | $Not Provided – $Not Provided |
Type | Full-Time |
Degrees | Bachelor’s, Master’s |
Experience Level | Senior |
Requirements
- Bachelor’s or master’s degree in computer science or related field, or equivalent practical experience.
- 5+ years of software engineering experience, with a strong focus on distributed systems architecture and optimization.
- Deep understanding of distributed systems principles.
- Proven experience with container orchestration technologies, particularly Kubernetes (K8s).
- Strong programming skills in Python. C++ experience is a plus.
- Experience with distributed messaging systems or RPC frameworks.
- Experience designing for high availability, fault tolerance, and scalability.
- Strong debugging and performance analysis skills in distributed environments.
- Familiarity with cloud-native technologies and microservices architectures.
Responsibilities
- Design, build, and operate foundational distributed systems components that power the Inference Platform with high availability, scalability, and performance.
- Architect and implement the core logic for distributed request routing, dynamic load balancing, replica synchronization, and distributed metadata management.
- Develop and enhance the fault tolerance and auto-recovery mechanisms for platform services and inference replicas.
- Optimize communication patterns and data flow between microservices to ensure minimal latency and maximal throughput at scale.
- Contribute to the design and implementation of the distributed orchestration and scheduling system for managing inference workloads and resources.
- Implement and refine monitoring, tracing, and alerting for distributed system components to ensure operational excellence.
- Collaborate closely with hardware, ML, and other software teams to ensure seamless integration and end-to-end system performance.
- Debug complex issues spanning multiple services and systems in a distributed environment.
Preferred Qualifications
-
No preferred qualifications provided.