CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Paper
•
2408.06072
•
Published
•
35
Note 1. The initial steps of the denoising process are critical for defining the generated video. In the base model, temporal and spatial alterations occur simultaneously, creating a unified evolution across both dimensions. 2. Spatial information is constructed earlier than temporal information. Specifically, with S-Director, the attention maps reveal that the structural outlines of the final video appear much earlier than with temporal control. 3. concatenate the noisy latent