Posted in

Senior Staff Machine Learning Engineer – Site Reliability Engineer

Senior Staff Machine Learning Engineer – Site Reliability Engineer

CompanyServiceNow
LocationSanta Clara, CA, USA
Salary$Not Provided – $Not Provided
TypeFull-Time
Degrees
Experience LevelSenior

Requirements

  • Experience in leveraging or critically thinking about how to integrate AI into work processes, decision-making, or problem-solving. This may include using AI-powered tools, automating workflows, analyzing AI-driven insights, or exploring AI’s potential impact on the function or industry.
  • 8+ years of experience with infrastructure and platform operations, deployments, SRE, and DevOps with a continued focus on improving Platform health;
  • 6+ years of experience operating highly-available distributed workloads on Kubernetes following a DevOps approach.
  • 6+ years of development experience with Python, GoLang, Java or similar languages;
  • Experience with DevOps tooling (e.g. Helm / Ansible / Kubernetes / Prometheus /Splunk/ GitLab CI);
  • Strong working experience operating distributed systems built on Linux and J2EE;
  • Experience with software-defined networking, infrastructure as code and configuration management;
  • Experience building software for compliance and security in regulated environments
  • Ability to drive outcome in projects with material technical risk.

Responsibilities

  • Contribute to the design, development and implementation of infrastructure, platform, deployment and observability features that power AI workloads.
  • Collaborate with researchers, AI engineers, and infrastructure teams to ensure our GPU clusters perform efficiently, scale well, and remain reliable.
  • Contribute to the continuous improvement of the SRE practice by turning operational use cases into requirements for software tooling.
  • Contribute to the execution of deployment and support activities for AI/ML developers;
  • Build high-quality, clean, scalable and reusable code by enforcing best practices around software engineering architecture and processes (Code Reviews, Unit testing, etc.);
  • Work with the product owners to understand detailed requirements and own your code from design, implementation, test automation and delivery of high-quality product to our users;
  • Experience with operating LLMs on NVIDIA GPUs.
  • Be a mentor for colleagues and help promote knowledge-sharing.

Preferred Qualifications

    No preferred qualifications provided.