Software Engineer Graduate – AIGC Platform – Monetization GenAI
Company | ByteDance |
---|---|
Location | San Jose, CA, USA |
Salary | $Not Provided – $Not Provided |
Type | Full-Time |
Degrees | PhD |
Experience Level | Entry Level/New Grad |
Requirements
- PhD in Computer Science, Computer Engineering, or a related field.
- Proficiency in Python and familiarity with deep learning frameworks like PyTorch.
- Strong skills in Linux, Docker, Kubernetes, and high-performance computing principles, including Infrastructure as Code (IaC).
- Demonstrated expertise in scaling and optimizing generative AI engineering tasks in GPU-intensive environments.
- Expertise in scaling generative AI models using sequence parallel, model parallel, and pipeline parallel techniques on multiple GPUs.
- Proven ability to guide and automate the acceleration of model deployment efficiently, enhancing platform capabilities and reducing time-to-market for new features.
Responsibilities
- Work closely with infrastructure architects and SREs and to enhance the Generative AI Platform’s availability, scalability, and cost-efficiency.
- Engineer robust, high-performance data processing and large language model training/inference pipelines, drive engineering excellence and optimization initiatives to ensure the most effective use of resources, including cost optimization and performance tuning of the ML platform.
- Provide a cutting-edge platform to model researchers and data pipeline engineers, accelerating the development and deployment of innovative ML models.
- Stay abreast of the latest advancements in machine learning infrastructure to implement solutions that enhance platform efficiency and performance.
Preferred Qualifications
- A strong preference for candidates with good experience in CUDA, training/inference FP8/FP4 optimization etc.
- Deep understanding of cloud infrastructure platforms like GCP/Azure, and experience collaborating with DevOps/SRE teams for large-scale ML project deployments.
- Familiarity with scheduling services like Linux Slurm, Kubernetes Volcano, or third-party tools like Run.AI, etc.
- Experience in large-scale ML training and deployment, familiarity with distributed computing frameworks such as Ray.io.
- Strong problem-solving skills, and proficient in communication, collaboration, and project management.