Conference Paper

[NAACL 2024] LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?

We explore the previously untapped advantages of diffusion models over autoregressive (AR) methods in image-to-text generation. Through meticulous design of a latent-based diffusion model tailored for captioning, we achieve comparable performance with some strong AR baselines.

Yuchi Wang, Shuhuai Ren, Rundong Gao, Linli Yao, Qingyan Guo, Kaikai An, Jianhong Bai, Xu Sun

[NAACL 2024] LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?

[ICLR 2024] GAIA: Zero-shot Talking Avatar Generation

An inside project in Microsoft. We design a codec to disentangle each frame of talking face videos into motion and appearance representations and then curated a large-scale, high-quality dataset to train our diffusion-based GAIA model. The results demonstrate remarkable naturalness and scalability.

Tianyu He, Junliang Guo, Runyi Yu, Yuchi Wang, Jialiang Zhu, Kaikai An, Leyi Li, Xu Tan, Chunyu Wang, Han Hu, HsiangTao Wu, Sheng Zhao, Jiang Bian

[FMDM@NeurIPS 2023] Towards end-to-end embodied decision making via multi-modal large language model: Explorations with gpt4-vision and beyond

We found that powerful multimodal LLM like GPT4-Vision makes End-to-End embodied decision making more possible than ever. Moreover, we propose a new benchmark called PCA-EVAL and a multi-agent cooperation framework HOLMES for evaluation.

Liang Chen, Yichi Zhang, Shuhuai Ren, Haozhe Zhao, Zefan Cai, Yuchi Wang, Peiyi Wang, Tianyu Liu, Baobao Chang