GLIGEN: Open-Set Grounded Text-to-Image Generation
improved visual quality as the rough concept location and outline are decided in the early stages, followed by fine-grained details in later stages.
Note As stated in Eq. (8) and Eq. (10), we can schedule inference time sampling by setting β to 1 (use extra grounding information) or 0 (reduce to the original pretrained diffusion model). This can make our model exploit different knowledge at different stages. Fig. 7 qualitatively shows the benefits of our scheduled sampling by setting τ to be 0.2