Abstract
We introduce Wuerstchen, a novel technique for text-to-image synthesis that unites competitive performance with unprecedented cost-effectiveness and ease of training on constrained hardware. Building on recent advancements in machine learning, our approach, which utilizes latent diffusion strategies at strong latent image compression rates, significantly reduces the computational burden, typically associated with state-of-the-art models, while preserving, if not enhancing, the quality of generated images. Wuerstchen achieves notable speed improvements at inference time, thereby rendering real-time applications more viable. One of the key advantages of our method lies in its modest training requirements of only 9,200 GPU hours, slashing the usual costs significantly without compromising the end performance. In a comparison against the state-of-the-art, we found the approach to yield strong competitiveness. This paper opens the door to a new line of research that prioritizes both performance and computational accessibility, hence democratizing the use of sophisticated AI technologies. Through Wuerstchen, we demonstrate a compelling stride forward in the realm of text-to-image synthesis, offering an innovative path to explore in future research.
Community
@dome272
in Figure 6.:
you are showing inference times for different batch sizes. Two questions:
- 1.) Which hardware did you use (GPU / CPU)?
- 2.) How does this compare to SD? How much faster/slower is Wuerstchen compared to SD?
Also do we really need 60 sampling steps for the prior? If we could get this down to something like 20, this model would be super fast
Also do we really need 60 sampling steps for the prior? If we could get this down to something like 20, this model would be super fast
We are running some experiments specifically focused on trying to reduce the required number of sampling steps. We have already improved the speed of stage B (upsampler) quite a bit, and we're trying to see if the same approach could help reduce the number of sampling steps of stage C (text2img prior) π€
@dome272 in Figure 6.:
you are showing inference times for different batch sizes. Two questions:
- 1.) Which hardware did you use (GPU / CPU)?
- 2.) How does this compare to SD? How much faster/slower is Wuerstchen compared to SD?
Hey Patrick,
- Its A100
- The speed is similar to SD, but there is probably a lot to be optimized that can make this model extremely fast!
We are working on it!
An immediate way to that could be the use of torch.compile()
and token merging. I know the latter might lead to visual quality degradation (but a smaller token ratio doesn't hurt much).
Models citing this paper 5
Browse 5 models citing this paperDatasets citing this paper 0
No dataset linking this paper