Posted in

Software Engineer Graduate – AIGC Platform – Monetization GenAI

Software Engineer Graduate – AIGC Platform – Monetization GenAI

CompanyByteDance
LocationSan Jose, CA, USA
Salary$Not Provided – $Not Provided
TypeFull-Time
DegreesPhD
Experience LevelEntry Level/New Grad

Requirements

  • PhD in Computer Science, Computer Engineering, or a related field.
  • Proficiency in Python and familiarity with deep learning frameworks like PyTorch.
  • Strong skills in Linux, Docker, Kubernetes, and high-performance computing principles, including Infrastructure as Code (IaC).
  • Demonstrated expertise in scaling and optimizing generative AI engineering tasks in GPU-intensive environments.
  • Expertise in scaling generative AI models using sequence parallel, model parallel, and pipeline parallel techniques on multiple GPUs.
  • Proven ability to guide and automate the acceleration of model deployment efficiently, enhancing platform capabilities and reducing time-to-market for new features.

Responsibilities

  • Work closely with infrastructure architects and SREs and to enhance the Generative AI Platform’s availability, scalability, and cost-efficiency.
  • Engineer robust, high-performance data processing and large language model training/inference pipelines, drive engineering excellence and optimization initiatives to ensure the most effective use of resources, including cost optimization and performance tuning of the ML platform.
  • Provide a cutting-edge platform to model researchers and data pipeline engineers, accelerating the development and deployment of innovative ML models.
  • Stay abreast of the latest advancements in machine learning infrastructure to implement solutions that enhance platform efficiency and performance.

Preferred Qualifications

  • A strong preference for candidates with good experience in CUDA, training/inference FP8/FP4 optimization etc.
  • Deep understanding of cloud infrastructure platforms like GCP/Azure, and experience collaborating with DevOps/SRE teams for large-scale ML project deployments.
  • Familiarity with scheduling services like Linux Slurm, Kubernetes Volcano, or third-party tools like Run.AI, etc.
  • Experience in large-scale ML training and deployment, familiarity with distributed computing frameworks such as Ray.io.
  • Strong problem-solving skills, and proficient in communication, collaboration, and project management.