Senior Staff Machine Learning Engineer – Site Reliability Engineer
Company | ServiceNow |
---|---|
Location | Santa Clara, CA, USA |
Salary | $Not Provided – $Not Provided |
Type | Full-Time |
Degrees | |
Experience Level | Senior |
Requirements
- Experience in leveraging or critically thinking about how to integrate AI into work processes, decision-making, or problem-solving. This may include using AI-powered tools, automating workflows, analyzing AI-driven insights, or exploring AI’s potential impact on the function or industry.
- 8+ years of experience with infrastructure and platform operations, deployments, SRE, and DevOps with a continued focus on improving Platform health;
- 6+ years of experience operating highly-available distributed workloads on Kubernetes following a DevOps approach.
- 6+ years of development experience with Python, GoLang, Java or similar languages;
- Experience with DevOps tooling (e.g. Helm / Ansible / Kubernetes / Prometheus /Splunk/ GitLab CI);
- Strong working experience operating distributed systems built on Linux and J2EE;
- Experience with software-defined networking, infrastructure as code and configuration management;
- Experience building software for compliance and security in regulated environments
- Ability to drive outcome in projects with material technical risk.
Responsibilities
- Contribute to the design, development and implementation of infrastructure, platform, deployment and observability features that power AI workloads.
- Collaborate with researchers, AI engineers, and infrastructure teams to ensure our GPU clusters perform efficiently, scale well, and remain reliable.
- Contribute to the continuous improvement of the SRE practice by turning operational use cases into requirements for software tooling.
- Contribute to the execution of deployment and support activities for AI/ML developers;
- Build high-quality, clean, scalable and reusable code by enforcing best practices around software engineering architecture and processes (Code Reviews, Unit testing, etc.);
- Work with the product owners to understand detailed requirements and own your code from design, implementation, test automation and delivery of high-quality product to our users;
- Experience with operating LLMs on NVIDIA GPUs.
- Be a mentor for colleagues and help promote knowledge-sharing.
Preferred Qualifications
-
No preferred qualifications provided.