Hi! I’m a PhD student at MMLab@CUHK, where I am supervised by Prof. Hongsheng Li. Prior to this, I completed my Master’s degree at the AAIS, Peking University, during which I was a member of the Lanco Lab, lead by Prof. Xu Sun. I earned my Bachelor’s degree from the School of Data Science, Fudan University.
My research interests encompass (1) Multimodal learning, including visual understanding, text-guided visual generation (e.g., image/video generation or editing, talking face generation), etc (2) Generative Models such as diffusion models and VAEs (3) LLMs (especially multimodal large language models) and their applications like embodied AI.
AI PhD, CUHK, 2029 (Expected)
MMLab; Advisor: Hongsheng Li
Master in Data Science, Peking University, 2025
Lanco Lab; Advisor: Xu Sun; AAIS
BSc in Data Science, Fudan University, 2022
School of Data Science; GPA Rank: 3/85
We propose RICO, a novel framework that refines captions through visual reconstruction. Extensive experiments demonstrate that our approach significantly improves caption accuracy and completeness.
We propose a novel and compact video autoencoder, VidTwin, that decouples video into two distinct latent spaces: Structure latent vectors, which capture overall content and global movement, and Dynamics latent vectors, which represent fine-grained details and rapid movements.
We present InstructAvatar, a novel text-guided approach for generating emotionally expressive 2D avatars, offering fine-grained control, improved interactivity and generalizability to the resulting video.
We explore the previously untapped advantages of diffusion models over autoregressive (AR) methods in image-to-text generation. Through meticulous design of a latent-based diffusion model tailored for captioning, we achieve comparable performance with some strong AR baselines.
I have served as a reviewer for the following conferences: ARR (ACL 2025, EMNLP 2025)
I served as a teaching assistant for the course Large Language Model in Decision Intelligence (PKU Class 2024 Spring). This course is tailored for undergraduates, aiming to provide them with a foundational understanding of large language models and effective strategies for their utilization.