Research Scientist-Foundation Model - Vision and Language

Research Scientist-Foundation Model – Vision and Language

Research and engineering experience in one or more areas of computer vision and natural language processing, including but not limited to: Experience in multi-modal understanding, vision and language, such as multimodal pre-training, visual instruction tuning, alignment learning, and other related topics.
Work with very large-scale datasets, and build very large-scale datasets to scale up foundation models.
Experience with language models and apply them in various downstream tasks.
Highly competent in algorithms and programming; Strong coding skills in Python and popular deep learning frameworks.
Work and collaborate well with team members.
Ability to work independently; Strong communication skills.

Conduct cutting-edge research and development in computer vision and natural language processing, especially in the areas of multi-modality, vision and language, etc.
Enhance multimodal understanding and reasoning (images and videos etc), throughout the entire development process, encompassing data acquisition, model evaluation, pre-training, SFT, reward modeling, and reinforcement learning, to bolster overall performance.
Synthesize large-scale, high-quality multi-modal data through methods such as rewriting, augmentation, and generation to improve the abilities of foundation models in various stages (pretraining, SFT, RLHF).
Investigate and implement robust evaluation methodologies to assess model performance at various stages (ranging from covering diverse multimodal skills to improving user preference alignment), unravel the underlying mechanisms and sources of their abilities, and utilize this understanding to drive model improvements.

Candidates with publications in top-tier venues such as CVPR, ECCV, ICCV, NeurIPS, ICLR, ICML, EMNLP, ACL, NAACL, etc
Candidates with impactful open-source projects on GitHub and a demonstrated engineering ability to quickly solve new challenges.