Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation
Abstract
Significant advancements have been achieved in the realm of large-scale pre-trained text-to-video Diffusion Models (VDMs). However, previous methods either rely solely on pixel-based VDMs, which come with high computational costs, or on latent-based VDMs, which often struggle with precise text-video alignment. In this paper, we are the first to propose a hybrid model, dubbed as Show-1, which marries pixel-based and latent-based VDMs for text-to-video generation. Our model first uses pixel-based VDMs to produce a low-resolution video of strong text-video correlation. After that, we propose a novel expert translation method that employs the latent-based VDMs to further upsample the low-resolution video to high resolution. Compared to latent VDMs, Show-1 can produce high-quality videos of precise text-video alignment; Compared to pixel VDMs, Show-1 is much more efficient (GPU memory usage during inference is 15G vs 72G). We also validate our model on standard video generation benchmarks. Our code and model weights are publicly available at https://github.com/showlab/Show-1.
Community
Love the project page - always great to have some video examples.
My summary: Show-1 is a hybrid model that combines pixel and latent diffusion for efficient high-quality text-to-video generation. Both of these approaches have tradeoffs, so researchers tried a hybrid approach combining both.
Highlights from the paper:
- Pixel diffusion excels at low-res video generation precisely aligned with text
- Latent diffusion acts as efficient upsampling expert from low to high res
- Chaining the two techniques inherits benefits of both Show-1 achieves strong alignment, quality, and 15x less inference memory
- The key is using pixel diffusion for the initial low-resolution stage. This retains alignment with text descriptions.
- Latent diffusion then serves as a super-resolution expert, upsampling efficiently while preserving fidelity.
By blending complementary techniques, Show-1 moves past tradeoffs limiting the individual models.
More details here. Paper is here (includes links to example generations).
Finally, open source T2V! ๐คฉ
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models (2023)
- VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation (2023)
- Dual-Stream Diffusion Net for Text-to-Video Generation (2023)
- SimDA: Simple Diffusion Adapter for Efficient Video Generation (2023)
- ModelScope Text-to-Video Technical Report (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
Models citing this paper 6
Browse 6 models citing this paperDatasets citing this paper 0
No dataset linking this paper