Principal Aiops Engineer - Enterprise AI Platform

Principal Aiops Engineer – Enterprise AI Platform

Company	Palo Alto Networks
Location	Santa Clara, CA, USA
Salary	$Not Provided – $Not Provided
Type	Full-Time
Degrees	Master’s, PhD
Experience Level	Senior, Expert or higher

Requirements

10+ years of experience in software engineering, reliability engineering, or IT operations, including at least 5 years leading the design and implementation of AIOps solutions at scale
Proven expertise in applying machine learning algorithms and data analysis techniques to solve complex IT operational challenges
Strong hands-on experience in building and maintaining scalable data pipelines and workflows for efficient data collection, processing, and analysis from diverse IT sources
Proficiency in programming languages such as Python, Go, Java, or Scala
Extensive experience with cloud platforms (e.g., AWS, Azure, Google Cloud) and containerization technologies (e.g., Docker, Kubernetes)
Familiarity with data processing frameworks (e.g., Apache Kafka, Apache Spark) and IT monitoring tools (e.g., Prometheus, Grafana, Datadog, Splunk)
Deep understanding of distributed systems architecture, microservices, and their operational challenges
Demonstrated ability to translate business requirements and operational pain points into technical specifications and deliver robust AIOps solutions
Excellent problem-solving skills and the ability to troubleshoot complex platform-related issues
Strong communication and interpersonal skills, with a track record of influencing technical and cross-functional stakeholders

Responsibilities

Design, develop, and implement advanced AIOps solutions, leveraging machine learning algorithms and data analytics to automate and enhance IT operations
Lead the implementation of AI/ML models for proactive anomaly detection, root cause analysis, and predictive insights into system health and performance across applications and infrastructure at enterprise scale
Drive the automation of routine operational tasks, incident response, and remediation workflows using AI-driven agents and orchestration tools, minimizing manual intervention and improving operational efficiency
Collaborate with observability teams to ensure the efficient collection, processing, and transformation of high-volume, cross-domain data from diverse sources (events, logs, metrics, tickets, monitoring tools) into actionable intelligence for the AIOps platform
Integrate AIOps insights with existing incident management systems, providing real-time intelligence to rapidly identify, diagnose, and resolve IT issues, leading to proactive issue resolution and reduced mean time to recovery (MTTR)
Utilize AI insights to continuously monitor, analyze, and fine-tune IT systems for peak operational efficiency, capacity planning, and resource optimization
Provide technical leadership and mentorship to other engineers, promoting architectural excellence, innovation, and best practices in AIOps development and operations
Partner with data scientists, ML engineers, software engineers, SREs, and IT operations teams to integrate AI/ML agents into the platform and ensure AIOps solutions align with business needs and deliver measurable ROI
Actively research and evaluate emerging AIOps technologies, generative AI, LLM models, ChatOps AI, and advanced RAGs, bringing promising innovations into production through POCs and long-term architectural evolution

Preferred Qualifications

Master’s degree or Ph.D. in Computer Science, Machine Learning, or a related technical field or equivalent military experience required
Experience with agentic systems and AI agents for automation
Experience with DevOps practices and CI/CD pipelines in an AIOps context
Prior experience in cybersecurity operations or building AIOps solutions for security threat detection and response