Posted in

Technical Specialist – Cloud Infrastructure

Technical Specialist – Cloud Infrastructure

CompanyLucid Motors
LocationNewark, CA, USA
Salary$138200 – $202730
TypeFull-Time
DegreesBachelor’s, Master’s
Experience LevelExpert or higher

Requirements

  • B.S. or M.S. degree in Computer Science, Engineering, or a related technical field, or equivalent experience may be considered in lieu of degree.
  • 8+ years in Cloud Infrastructure, Site Reliability Engineering (SRE), DevOps Engineering, or related fields
  • At least 4+ years of hands-on experience deploying, managing, and optimizing containerized applications using Kubernetes in both public and private cloud environments (AWS, GCP, Azure, etc.).
  • 4+ years in Infrastructure-as-Code (IaC) using Terraform, Cluster API, or similar automation frameworks to manage cloud infrastructure.
  • Experience in scripting or programming with Python, Go, Bash/Shell, or similar languages.
  • Strong understanding of using Prometheus, Grafana, and other monitoring and observability tools.
  • Ability to effectively diagnose and resolve performance bottlenecks within AWS at the infrastructure and application layers.

Responsibilities

  • Own and enhance the reliability of services deployed across various cloud regions. You will proactively monitor, automate, and scale services to ensure seamless uptime and performance.
  • Lead the containerization and deployment of microservices and data pipelines on Kubernetes, using Helm charts, ensuring best practices for scalability and fault tolerance.
  • Foster and advocate for a DevOps culture that emphasizes automation, self-service, and engineering excellence. Enable development teams to manage and deploy applications seamlessly with minimal intervention.
  • Implement autoscaling strategies and monitor the performance of applications and infrastructure with tools like Prometheus, Grafana, and other observability platforms.
  • Perform SRE tasks such as availability monitoring, incident response, post-mortem analysis, and preparing reliability reports for leadership and stakeholders.
  • Deploy, configure, and maintain essential cloud services and tools including Kafka, Spark, Presto, Airflow, MQTT, and other microservices platforms in a cloud-native environment.
  • Set up and manage cloud infrastructure using tools like Terraform, Cluster API, and other IaC frameworks, ensuring seamless provisioning, management, and scaling of resources.
  • Continuously enhance and automate alerting, incident detection, and recovery mechanisms for critical applications and services to minimize downtime and improve system reliability.
  • Participate in an on-call rotation to meet business SLAs, quickly troubleshoot and resolve issues, and document runbooks for consistent incident management processes.
  • Work closely with Product Owners, Engineering Managers, and cross-functional teams in Agile Scrum and Kanban workflows to deliver iterative improvements and meet evolving business needs.
  • Perform impact analysis during incidents, collaborate with teams for root cause analysis, and implement preventive measures to avoid recurrence.

Preferred Qualifications

  • Configuration Management: Experience with configuration management and automation tools such as Ansible, Chef, or Puppet (preferred but not required).