Posted in

Principal Aiops Engineer – Enterprise AI Platform

Principal Aiops Engineer – Enterprise AI Platform

CompanyPalo Alto Networks
LocationSanta Clara, CA, USA
Salary$Not Provided – $Not Provided
TypeFull-Time
DegreesMaster’s, PhD
Experience LevelSenior, Expert or higher

Requirements

  • 10+ years of experience in software engineering, reliability engineering, or IT operations, including at least 5 years leading the design and implementation of AIOps solutions at scale
  • Proven expertise in applying machine learning algorithms and data analysis techniques to solve complex IT operational challenges
  • Strong hands-on experience in building and maintaining scalable data pipelines and workflows for efficient data collection, processing, and analysis from diverse IT sources
  • Proficiency in programming languages such as Python, Go, Java, or Scala
  • Extensive experience with cloud platforms (e.g., AWS, Azure, Google Cloud) and containerization technologies (e.g., Docker, Kubernetes)
  • Familiarity with data processing frameworks (e.g., Apache Kafka, Apache Spark) and IT monitoring tools (e.g., Prometheus, Grafana, Datadog, Splunk)
  • Deep understanding of distributed systems architecture, microservices, and their operational challenges
  • Demonstrated ability to translate business requirements and operational pain points into technical specifications and deliver robust AIOps solutions
  • Excellent problem-solving skills and the ability to troubleshoot complex platform-related issues
  • Strong communication and interpersonal skills, with a track record of influencing technical and cross-functional stakeholders

Responsibilities

  • Design, develop, and implement advanced AIOps solutions, leveraging machine learning algorithms and data analytics to automate and enhance IT operations
  • Lead the implementation of AI/ML models for proactive anomaly detection, root cause analysis, and predictive insights into system health and performance across applications and infrastructure at enterprise scale
  • Drive the automation of routine operational tasks, incident response, and remediation workflows using AI-driven agents and orchestration tools, minimizing manual intervention and improving operational efficiency
  • Collaborate with observability teams to ensure the efficient collection, processing, and transformation of high-volume, cross-domain data from diverse sources (events, logs, metrics, tickets, monitoring tools) into actionable intelligence for the AIOps platform
  • Integrate AIOps insights with existing incident management systems, providing real-time intelligence to rapidly identify, diagnose, and resolve IT issues, leading to proactive issue resolution and reduced mean time to recovery (MTTR)
  • Utilize AI insights to continuously monitor, analyze, and fine-tune IT systems for peak operational efficiency, capacity planning, and resource optimization
  • Provide technical leadership and mentorship to other engineers, promoting architectural excellence, innovation, and best practices in AIOps development and operations
  • Partner with data scientists, ML engineers, software engineers, SREs, and IT operations teams to integrate AI/ML agents into the platform and ensure AIOps solutions align with business needs and deliver measurable ROI
  • Actively research and evaluate emerging AIOps technologies, generative AI, LLM models, ChatOps AI, and advanced RAGs, bringing promising innovations into production through POCs and long-term architectural evolution

Preferred Qualifications

  • Master’s degree or Ph.D. in Computer Science, Machine Learning, or a related technical field or equivalent military experience required
  • Experience with agentic systems and AI agents for automation
  • Experience with DevOps practices and CI/CD pipelines in an AIOps context
  • Prior experience in cybersecurity operations or building AIOps solutions for security threat detection and response