Skip to contentTechnical Specialist – Cloud Infrastructure
Company | Lucid Motors |
---|
Location | Newark, CA, USA |
---|
Salary | $138200 – $202730 |
---|
Type | Full-Time |
---|
Degrees | Bachelor’s, Master’s |
---|
Experience Level | Expert or higher |
---|
Requirements
- B.S. or M.S. degree in Computer Science, Engineering, or a related technical field, or equivalent experience may be considered in lieu of degree.
- 8+ years in Cloud Infrastructure, Site Reliability Engineering (SRE), DevOps Engineering, or related fields
- At least 4+ years of hands-on experience deploying, managing, and optimizing containerized applications using Kubernetes in both public and private cloud environments (AWS, GCP, Azure, etc.).
- 4+ years in Infrastructure-as-Code (IaC) using Terraform, Cluster API, or similar automation frameworks to manage cloud infrastructure.
- Experience in scripting or programming with Python, Go, Bash/Shell, or similar languages.
- Strong understanding of using Prometheus, Grafana, and other monitoring and observability tools.
- Ability to effectively diagnose and resolve performance bottlenecks within AWS at the infrastructure and application layers.
Responsibilities
- Own and enhance the reliability of services deployed across various cloud regions. You will proactively monitor, automate, and scale services to ensure seamless uptime and performance.
- Lead the containerization and deployment of microservices and data pipelines on Kubernetes, using Helm charts, ensuring best practices for scalability and fault tolerance.
- Foster and advocate for a DevOps culture that emphasizes automation, self-service, and engineering excellence. Enable development teams to manage and deploy applications seamlessly with minimal intervention.
- Implement autoscaling strategies and monitor the performance of applications and infrastructure with tools like Prometheus, Grafana, and other observability platforms.
- Perform SRE tasks such as availability monitoring, incident response, post-mortem analysis, and preparing reliability reports for leadership and stakeholders.
- Deploy, configure, and maintain essential cloud services and tools including Kafka, Spark, Presto, Airflow, MQTT, and other microservices platforms in a cloud-native environment.
- Set up and manage cloud infrastructure using tools like Terraform, Cluster API, and other IaC frameworks, ensuring seamless provisioning, management, and scaling of resources.
- Continuously enhance and automate alerting, incident detection, and recovery mechanisms for critical applications and services to minimize downtime and improve system reliability.
- Participate in an on-call rotation to meet business SLAs, quickly troubleshoot and resolve issues, and document runbooks for consistent incident management processes.
- Work closely with Product Owners, Engineering Managers, and cross-functional teams in Agile Scrum and Kanban workflows to deliver iterative improvements and meet evolving business needs.
- Perform impact analysis during incidents, collaborate with teams for root cause analysis, and implement preventive measures to avoid recurrence.
Preferred Qualifications
- Configuration Management: Experience with configuration management and automation tools such as Ansible, Chef, or Puppet (preferred but not required).