Principal Aiops Engineer – Enterprise AI Platform
Company | Palo Alto Networks |
---|---|
Location | Santa Clara, CA, USA |
Salary | $Not Provided – $Not Provided |
Type | Full-Time |
Degrees | Master’s, PhD |
Experience Level | Senior, Expert or higher |
Requirements
- 10+ years of experience in software engineering, reliability engineering, or IT operations, including at least 5 years leading the design and implementation of AIOps solutions at scale
- Proven expertise in applying machine learning algorithms and data analysis techniques to solve complex IT operational challenges
- Strong hands-on experience in building and maintaining scalable data pipelines and workflows for efficient data collection, processing, and analysis from diverse IT sources
- Proficiency in programming languages such as Python, Go, Java, or Scala
- Extensive experience with cloud platforms (e.g., AWS, Azure, Google Cloud) and containerization technologies (e.g., Docker, Kubernetes)
- Familiarity with data processing frameworks (e.g., Apache Kafka, Apache Spark) and IT monitoring tools (e.g., Prometheus, Grafana, Datadog, Splunk)
- Deep understanding of distributed systems architecture, microservices, and their operational challenges
- Demonstrated ability to translate business requirements and operational pain points into technical specifications and deliver robust AIOps solutions
- Excellent problem-solving skills and the ability to troubleshoot complex platform-related issues
- Strong communication and interpersonal skills, with a track record of influencing technical and cross-functional stakeholders
Responsibilities
- Design, develop, and implement advanced AIOps solutions, leveraging machine learning algorithms and data analytics to automate and enhance IT operations
- Lead the implementation of AI/ML models for proactive anomaly detection, root cause analysis, and predictive insights into system health and performance across applications and infrastructure at enterprise scale
- Drive the automation of routine operational tasks, incident response, and remediation workflows using AI-driven agents and orchestration tools, minimizing manual intervention and improving operational efficiency
- Collaborate with observability teams to ensure the efficient collection, processing, and transformation of high-volume, cross-domain data from diverse sources (events, logs, metrics, tickets, monitoring tools) into actionable intelligence for the AIOps platform
- Integrate AIOps insights with existing incident management systems, providing real-time intelligence to rapidly identify, diagnose, and resolve IT issues, leading to proactive issue resolution and reduced mean time to recovery (MTTR)
- Utilize AI insights to continuously monitor, analyze, and fine-tune IT systems for peak operational efficiency, capacity planning, and resource optimization
- Provide technical leadership and mentorship to other engineers, promoting architectural excellence, innovation, and best practices in AIOps development and operations
- Partner with data scientists, ML engineers, software engineers, SREs, and IT operations teams to integrate AI/ML agents into the platform and ensure AIOps solutions align with business needs and deliver measurable ROI
- Actively research and evaluate emerging AIOps technologies, generative AI, LLM models, ChatOps AI, and advanced RAGs, bringing promising innovations into production through POCs and long-term architectural evolution
Preferred Qualifications
- Master’s degree or Ph.D. in Computer Science, Machine Learning, or a related technical field or equivalent military experience required
- Experience with agentic systems and AI agents for automation
- Experience with DevOps practices and CI/CD pipelines in an AIOps context
- Prior experience in cybersecurity operations or building AIOps solutions for security threat detection and response