Preprint

[EMNLP] RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction

We propose RICO, a novel framework that refines captions through visual reconstruction. Extensive experiments demonstrate that our approach significantly improves caption accuracy and completeness.

Yuchi Wang, Yishuo Cai, Shuhuai Ren, Sihan Yang, Linli Yao, Yuanxin Liu, Yuanxing Zhang, Pengfei Wan, Xu Sun

[EMNLP] RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction

[CVPR 2025] VidTwin: Video VAE with Decoupled Structure and Dynamics

We propose a novel and compact video autoencoder, VidTwin, that decouples video into two distinct latent spaces: Structure latent vectors, which capture overall content and global movement, and Dynamics latent vectors, which represent fine-grained details and rapid movements.

Yuchi Wang, Junliang Guo, Xinyi Xie, Tianyu He, Xu Sun, Jiang Bian

[CVPR 2025] VidTwin: Video VAE with Decoupled Structure and Dynamics

[ACL 2025] Rethinking Semantic Parsing for Large Language Models: Enhancing LLM Performance with Semantic Hints

we propose SENSE, a novel prompting approach that embeds semantic hints within the prompt. Experiments show that SENSE consistently improves LLMs’ performance across various tasks, highlighting the potential of integrating semantic information to improve LLM capabilities.

Kaikai An, Shuzheng Si, Helan Hu, Haozhe Zhao, Yuchi Wang, Qingyan Guo, Baobao Chang

[ArXiv] Make Your Actor Talk: Generalizable and High-Fidelity Lip Sync with Motion and Appearance Disentanglement

We present MyTalk, aiming to edit the lip movements in talking video according to the given speech while preserving the personal identity and visual details.

Runyi Yu, Tianyu He, Ailing Zeng, Yuchi Wang, Junliang Guo, Xu Tan, Chang Liu, Jie Chen, Jiang Bian

[AAAI 2025] InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation

We present InstructAvatar, a novel text-guided approach for generating emotionally expressive 2D avatars, offering fine-grained control, improved interactivity and generalizability to the resulting video.

Yuchi Wang, Junliang Guo, Jianhong Bai, Runyi Yu, Tianyu He, Xu Tan, Xu Sun, Jiang Bian

[ArXiv] UniEdit: A Unified Tuning-Free Framework for Video Motion and Appearance Editing

We present UniEdit, a tuning-free framework that supports both video motion and appearance editing by harnessing the power of a pre-trained text-to-video generator within an inversion-then-generation framework.

Jianhong Bai, Tianyu He, Yuchi Wang, Junliang Guo, Haoji Hu, Zuozhu Liu, Jiang Bian

[Findings of ACL 2024] PCA-Bench: Evaluating Multimodal Large Language Models in Perception-Cognition-Action Chain

We present PCA-Bench, a multimodal decision-making benchmark for evaluating the integrated capabilities of Multimodal Large Language Models (MLLMs).

Liang Chen, Yichi Zhang, Shuhuai Ren, Haozhe Zhao, Zefan Cai, Yuchi Wang, Peiyi Wang, Xiangdi Meng, Tianyu Liu, Baobao Chang

LLMs as Trustworthy Financial Advisors: Rationalizing Multimodal Stock Movement Prediction with Chain-of-Thought

Yi Liu, Yuchi Wang, Lei Li, Shicheng Li, Ruihan Bao, Keiko Harimoto, Xu Sun