VidTwin VidTwin: Video VAE with Decoupled Structure and Dynamics


Yuchi Wang12 Junliang Guo2 Xinyi Xie23 Tianyu He2 Xu Sun1 Jiang Bian2

1Peking University 2Microsoft Research Asia 3CUHK(SZ)

Paper arXiv Code

Video A (offers S. Latent)
Video B (offers D. Latent)
Video C (generated)
Video A (offers S. Latent)
Video B (offers D. Latent)
Video C (generated)
Vid.1 GIF Vid.2 GIF
Video A (offers S. Latent)
Video B (offers D. Latent)
Video C (generated)
Video A (offers S. Latent)
Video B (offers D. Latent)
Video C (generated)
Vid.1 GIF Vid.2 GIF

*Video A provides the Structure Latent, Video B provides the Dynamics Latent, and Video C is the generated video combining them.



Orig.
Recon.
S. Recon.
D. Recon.
Orig.
Recon.
S. Recon.
D. Recon.
Vid.1 GIF Vid.2 GIF
Orig.
Recon.
S. Recon.
D. Recon.
Orig.
Recon.
S. Recon.
D. Recon.
Vid.1 GIF Vid.2 GIF

*We present the original (Orig.) and reconstructed (Recon.) videos, along with videos decoded using only the Structure or Dynamics latents, denoted as S. Recon. and D. Recon., respectively.


Abstract

Recent advancements in video autoencoders (Video AEs) have significantly improved the quality and efficiency of video generation. In this paper, we propose a novel and compact video autoencoder, VidTwin, that decouples video into two distinct latent spaces: Structure latent vectors, which capture overall content and global movement, and Dynamics latent vectors, which represent fine-grained details and rapid movements. Specifically, our approach leverages an Encoder-Decoder backbone, augmented with two submodules for extracting these latent spaces, respectively. The first submodule employs a Q-Former to extract low-frequency motion trends, followed by downsampling blocks to remove redundant content details. The second averages the latent vectors along the spatial dimension to capture rapid motion. Extensive experiments show that VidTwin achieves a high compression rate of 0.20% with high reconstruction quality (PSNR of 28.14 on the MCL-JCV dataset), and performs efficiently and effectively in downstream generative tasks. Moreover, our model demonstrates explainability and scalability, paving the way for future research in video latent representation and generation.

Method

Details of our model. After obtaining the latent \(z\) from the Encoder, the process branches into two flows. The Structure Latent extraction module, \(\mathcal{F}_{\boldsymbol{S}}\), which consists of a Q-Former and convolutional networks, extracts the Structure Latent component \(z_{\boldsymbol{S}}\). The Dynamics Latent extraction module, \(\mathcal{F}_{\boldsymbol{D}}\), comprising convolutional networks and an averaging operator, extracts the Dynamics Latent component \(z_{\boldsymbol{D}}\). Finally, using the decoding module, we align all latents to the same dimension and combine them before passing them into the Decoder.

Additional Decoupling Examples


Orig.
Recon.
S. Recon.
D. Recon.
Orig.
Recon.
S. Recon.
D. Recon.
Vid.1 GIF Vid.2 GIF
Orig.
Recon.
S. Recon.
D. Recon.
Orig.
Recon.
S. Recon.
D. Recon.
Vid.1 GIF Vid.2 GIF
Orig.
Recon.
S. Recon.
D. Recon.
Orig.
Recon.
S. Recon.
D. Recon.
Vid.1 GIF Vid.2 GIF

To investigate the roles of the Structure Latent and Dynamics Latent, we decode them individually by passing each through the decoder, i.e., generating results from \(\mathcal{D}(u_{\boldsymbol{S}})\) and \(\mathcal{D}(u_{\boldsymbol{D}})\). The results reveal that the Structure Latent captures the main semantic content of the video, while the Dynamics Latent captures fine-grained details, such as color and rapid local movements. Notably, in many cases, videos generated using the Dynamics Latent exhibit faster movements compared to those generated with the Structure Latent. This observation highlights the distinct contributions of low-frequency and high-frequency motion information in the latent representations.



Additional Cross-Reenactment Examples


Video A (offers S. Latent)
Video B (offers D. Latent)
Video C (generated)
Video A (offers S. Latent)
Video B (offers D. Latent)
Video C (generated)
Vid.1 GIF Vid.2 GIF
Video A (offers S. Latent)
Video B (offers D. Latent)
Video C (generated)
Video A (offers S. Latent)
Video B (offers D. Latent)
Video C (generated)
Vid.1 GIF Vid.2 GIF

We conduct a cross-reenactment experiment by combining the Structure Latent from one video, A, with the Dynamics Latent from another video, B, to generate the output from the decoder, represented as \(\mathcal{D}(u^A_{\boldsymbol{S}}, u^B_{\boldsymbol{D}})\). The generated video typically inherits the main object and overall structure from Video A, which provides the Structure Latent, while local details, such as color, come from Video B, which provides the Dynamics Latent. Notably, the movement in the generated video combines the rapid rotation from Video B with the gradual camera motion from Video A.



Reconstruction Examples


Orig.
Recon.
Orig.
Recon.
Orig.
Recon.
Vid.1 GIF Vid.2 GIF Vid.2 GIF

We present the original video (Orig.) alongside videos reconstructed by our model (Recon.), demonstrating the model's ability to effectively capture intricate details and preserve rapid motion dynamics, such as the light trails of a fast-moving car.



Orig.
iVideoGPT
MAGVIT-v2
Orig.
iVideoGPT
MAGVIT-v2
Vid.1 GIF Vid.2 GIF
EMU3
CMD
VidTwin(Ours)
EMU3
CMD
VidTwin(Ours)

We also present the reconstruction performance of our model compared to the baselines. Two examples are showcased: a gradually rotating photo and a fast-motion boxing scene. VidTwin demonstrates superior ability in reconstructing fine details and accurately capturing rapid motion.



Citation

@misc{wang2024vidtwinvideovaedecoupled,
      title={VidTwin: Video VAE with Decoupled Structure and Dynamics}, 
      author={Yuchi Wang and Junliang Guo and Xinyi Xie and Tianyu He and Xu Sun and Jiang Bian},
      year={2024},
      eprint={2412.17726},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.17726}, 
}