ControlNeXt: Powerful and Efficient Control for Image and Video Generation
Abstract
Diffusion models have demonstrated remarkable and robust abilities in both image and video generation. To achieve greater control over generated results, researchers introduce additional architectures, such as ControlNet, Adapters and ReferenceNet, to integrate conditioning controls. However, current controllable generation methods often require substantial additional computational resources, especially for video generation, and face challenges in training or exhibit weak control. In this paper, we propose ControlNeXt: a powerful and efficient method for controllable image and video generation. We first design a more straightforward and efficient architecture, replacing heavy additional branches with minimal additional cost compared to the base model. Such a concise structure also allows our method to seamlessly integrate with other LoRA weights, enabling style alteration without the need for additional training. As for training, we reduce up to 90% of learnable parameters compared to the alternatives. Furthermore, we propose another method called Cross Normalization (CN) as a replacement for Zero-Convolution' to achieve fast and stable training convergence. We have conducted various experiments with different base models across images and videos, demonstrating the robustness of our method.
Community
Hi, I noticed that there might be a mistake in Eq. 8 and Eq. 9. \mu_c should be \mu_m and \sigma_c should be \sigma_m.
Thanks for sharing our work! :D
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ExVideo: Extending Video Diffusion Models via Parameter-Efficient Post-Tuning (2024)
- Image Conductor: Precision Control for Interactive Video Synthesis (2024)
- VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control (2024)
- Video-Infinity: Distributed Long Video Generation (2024)
- VEnhancer: Generative Space-Time Enhancement for Video Generation (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 6
Browse 6 models citing this paperDatasets citing this paper 0
No dataset linking this paper