Yuchi Wang

PhD Student of Artificial Intelligence

MMLab

Chinese University of Hong Kong

Biography

Hi! I’m a PhD student at MMLab@CUHK, where I am supervised by Prof. Hongsheng Li. Prior to this, I completed my Master’s degree at the AAIS, Peking University, during which I was a member of the Lanco Lab, lead by Prof. Xu Sun. I earned my Bachelor’s degree from the School of Data Science, Fudan University.

My research interests encompass (1) Multimodal learning, including visual understanding, text-guided visual generation (e.g., image/video generation or editing, talking face generation), etc (2) Generative Models such as diffusion models and VAEs (3) LLMs (especially multimodal large language models) and their applications like embodied AI.

Interests

Multimodal learning
Generativa AI (AIGC)
Large language models
Diffusion models

Education

AI PhD, CUHK, 2029 (Expected)
MMLab; Advisor: Hongsheng Li
Master in Data Science, Peking University, 2025
Lanco Lab; Advisor: Xu Sun; AAIS
BSc in Data Science, Fudan University, 2022
School of Data Science; GPA Rank: 3/85

News

[2025.08] Joining in MMLab@CUHK
[2025.08] One paper accepted by EMNLP 2025
[2025.07] One paper accepted by ACMMM 2025
[2025.06] One paper accepted by ACL 2025
[2025.02] One paper accepted by CVPR 2025
[2024.12] One paper accepted by AAAI 2025
[2024.12] Two papers accepted by FinNLP@COLING 2025
[2024.06] We release the InstructAvatar project page!
[2024.05] One paper accepted by ACL 2024 (Findings)
[2024.03] One paper accepted by NAACL 2024
[2024.01] One paper accepted by ICLR 2024
[2023.10] We release GAIA demo!
[2023.10] One paper accepted by FMDM@NeurIPS 2023
[2023.05] Starting internship at MSRA ML Group
[2022.09] Joining in Lanco Lab, Peking University

Featured Publications

Yuchi Wang, Yishuo Cai, Shuhuai Ren, Sihan Yang, Linli Yao, Yuanxin Liu, Yuanxing Zhang, Pengfei Wan, Xu Sun

May, 2025 Arxiv

[EMNLP] RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction

We propose RICO, a novel framework that refines captions through visual reconstruction. Extensive experiments demonstrate that our approach significantly improves caption accuracy and completeness.

Yuchi Wang, Junliang Guo, Xinyi Xie, Tianyu He, Xu Sun, Jiang Bian

December, 2024 Arxiv

[CVPR 2025] VidTwin: Video VAE with Decoupled Structure and Dynamics

We propose a novel and compact video autoencoder, VidTwin, that decouples video into two distinct latent spaces: Structure latent vectors, which capture overall content and global movement, and Dynamics latent vectors, which represent fine-grained details and rapid movements.

Yuchi Wang, Junliang Guo, Jianhong Bai, Runyi Yu, Tianyu He, Xu Tan, Xu Sun, Jiang Bian

April, 2024 Arxiv

[AAAI 2025] InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation

We present InstructAvatar, a novel text-guided approach for generating emotionally expressive 2D avatars, offering fine-grained control, improved interactivity and generalizability to the resulting video.

Yuchi Wang, Shuhuai Ren, Rundong Gao, Linli Yao, Qingyan Guo, Kaikai An, Jianhong Bai, Xu Sun

March, 2024 NAACL 2024

[NAACL 2024] LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?

We explore the previously untapped advantages of diffusion models over autoregressive (AR) methods in image-to-text generation. Through meticulous design of a latent-based diffusion model tailored for captioning, we achieve comparable performance with some strong AR baselines.

Tianyu He, Junliang Guo, Runyi Yu, Yuchi Wang, Jialiang Zhu, Kaikai An, Leyi Li, Xu Tan, Chunyu Wang, Han Hu, HsiangTao Wu, Sheng Zhao, Jiang Bian

November, 2023 ICLR 2024

[ICLR 2024] GAIA: Zero-shot Talking Avatar Generation

An inside project in Microsoft. We design a codec to disentangle each frame of talking face videos into motion and appearance representations and then curated a large-scale, high-quality dataset to train our diffusion-based GAIA model. The results demonstrate remarkable naturalness and scalability.

All Publications

[EMNLP] RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction

We propose RICO, a novel framework that refines captions through visual reconstruction. Extensive experiments demonstrate that our approach significantly improves caption accuracy and completeness.

Yuchi Wang, Yishuo Cai, Shuhuai Ren, Sihan Yang, Linli Yao, Yuanxin Liu, Yuanxing Zhang, Pengfei Wan, Xu Sun

[CVPR 2025] VidTwin: Video VAE with Decoupled Structure and Dynamics

We propose a novel and compact video autoencoder, VidTwin, that decouples video into two distinct latent spaces: Structure latent vectors, which capture overall content and global movement, and Dynamics latent vectors, which represent fine-grained details and rapid movements.

Yuchi Wang, Junliang Guo, Xinyi Xie, Tianyu He, Xu Sun, Jiang Bian

[ACL 2025] Rethinking Semantic Parsing for Large Language Models: Enhancing LLM Performance with Semantic Hints

we propose SENSE, a novel prompting approach that embeds semantic hints within the prompt. Experiments show that SENSE consistently improves LLMs’ performance across various tasks, highlighting the potential of integrating semantic information to improve LLM capabilities.

Kaikai An, Shuzheng Si, Helan Hu, Haozhe Zhao, Yuchi Wang, Qingyan Guo, Baobao Chang

[ArXiv] Make Your Actor Talk: Generalizable and High-Fidelity Lip Sync with Motion and Appearance Disentanglement

We present MyTalk, aiming to edit the lip movements in talking video according to the given speech while preserving the personal identity and visual details.

Runyi Yu, Tianyu He, Ailing Zeng, Yuchi Wang, Junliang Guo, Xu Tan, Chang Liu, Jie Chen, Jiang Bian

[AAAI 2025] InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation

We present InstructAvatar, a novel text-guided approach for generating emotionally expressive 2D avatars, offering fine-grained control, improved interactivity and generalizability to the resulting video.

Yuchi Wang, Junliang Guo, Jianhong Bai, Runyi Yu, Tianyu He, Xu Tan, Xu Sun, Jiang Bian

See all publications

Work Experience

Research Intern

Microsoft Research Asia (MSRA)

May 2023 – August 2024 Beijing, China

Excited to join the ML group, advised by Junliang Guo and Xu Tan.
During my internship, I have primarily focused on generative learning, including talking avatar generation, video generation/representation.

Academic Service

Conference Reviewer

I have served as a reviewer for the following conferences: ARR (ACL 2025, EMNLP 2025)

Teaching Assistant

I served as a teaching assistant for the course Large Language Model in Decision Intelligence (PKU Class 2024 Spring). This course is tailored for undergraduates, aiming to provide them with a foundational understanding of large language models and effective strategies for their utilization.