VidTwin: Video VAE with Decoupled Structure and Dynamics

Abstract

Recent advancements in video autoencoders (Video AEs) have significantly improved the quality and efficiency of video generation. In this paper, we propose a novel and compact video autoencoder, VidTwin, that decouples video into two distinct latent spaces: Structure latent vectors, which capture overall content and global movement, and Dynamics latent vectors, which represent fine-grained details and rapid movements. Specifically, our approach leverages an Encoder-Decoder backbone, augmented with two submodules for extracting these latent spaces, respectively. The first submodule employs a Q-Former to extract low-frequency motion trends, followed by downsampling blocks to remove redundant content details. The second averages the latent vectors along the spatial dimension to capture rapid motion. Extensive experiments show that VidTwin achieves a high compression rate of 0.20% with high reconstruction quality (PSNR of 28.14 on the MCL-JCV dataset), and performs efficiently and effectively in downstream generative tasks. Moreover, our model demonstrates explainability and scalability, paving the way for future research in video latent representation and generation.

Method

Details of our model. After obtaining the latent \(z\) from the Encoder, the process branches into two flows. The Structure Latent extraction module, \(\mathcal{F}_{\boldsymbol{S}}\), which consists of a Q-Former and convolutional networks, extracts the Structure Latent component \(z_{\boldsymbol{S}}\). The Dynamics Latent extraction module, \(\mathcal{F}_{\boldsymbol{D}}\), comprising convolutional networks and an averaging operator, extracts the Dynamics Latent component \(z_{\boldsymbol{D}}\). Finally, using the decoding module, we align all latents to the same dimension and combine them before passing them into the Decoder.

Additional Decoupling Examples

Orig. Recon. S. Recon. D. Recon.	Orig. Recon. S. Recon. D. Recon.

Orig. Recon. S. Recon. D. Recon.	Orig. Recon. S. Recon. D. Recon.

Orig. Recon. S. Recon. D. Recon.	Orig. Recon. S. Recon. D. Recon.

To investigate the roles of the Structure Latent and Dynamics Latent, we decode them individually by passing each through the decoder, i.e., generating results from \(\mathcal{D}(u_{\boldsymbol{S}})\) and \(\mathcal{D}(u_{\boldsymbol{D}})\). The results reveal that the Structure Latent captures the main semantic content of the video, while the Dynamics Latent captures fine-grained details, such as color and rapid local movements. Notably, in many cases, videos generated using the Dynamics Latent exhibit faster movements compared to those generated with the Structure Latent. This observation highlights the distinct contributions of low-frequency and high-frequency motion information in the latent representations.

Additional Cross-Reenactment Examples

Video A (offers S. Latent) Video B (offers D. Latent) Video C (generated)	Video A (offers S. Latent) Video B (offers D. Latent) Video C (generated)

Video A (offers S. Latent) Video B (offers D. Latent) Video C (generated)	Video A (offers S. Latent) Video B (offers D. Latent) Video C (generated)

We conduct a cross-reenactment experiment by combining the Structure Latent from one video, A, with the Dynamics Latent from another video, B, to generate the output from the decoder, represented as \(\mathcal{D}(u^A_{\boldsymbol{S}}, u^B_{\boldsymbol{D}})\). The generated video typically inherits the main object and overall structure from Video A, which provides the Structure Latent, while local details, such as color, come from Video B, which provides the Dynamics Latent. Notably, the movement in the generated video combines the rapid rotation from Video B with the gradual camera motion from Video A.

Reconstruction Examples

Orig. Recon.	Orig. Recon.	Orig. Recon.

We present the original video (Orig.) alongside videos reconstructed by our model (Recon.), demonstrating the model's ability to effectively capture intricate details and preserve rapid motion dynamics, such as the light trails of a fast-moving car.

Orig. iVideoGPT MAGVIT-v2	Orig. iVideoGPT MAGVIT-v2

EMU3 CMD VidTwin(Ours)	EMU3 CMD VidTwin(Ours)

We also present the reconstruction performance of our model compared to the baselines. Two examples are showcased: a gradually rotating photo and a fast-motion boxing scene. VidTwin demonstrates superior ability in reconstructing fine details and accurately capturing rapid motion.

Citation

@misc{wang2024vidtwinvideovaedecoupled, title={VidTwin: Video VAE with Decoupled Structure and Dynamics}, author={Yuchi Wang and Junliang Guo and Xinyi Xie and Tianyu He and Xu Sun and Jiang Bian}, year={2024}, eprint={2412.17726}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2412.17726}, }