This repository contains a pruned and isolated pipeline for Stage 2 of StreamingT2V, dubbed "VidXTend."

This model's primary purpose is extending 16-frame 256px x 256x animations by 8 frames at a time (one second at 8fps.)

@article{henschel2024streamingt2v,
  title={StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text},
  author={Henschel, Roberto and Khachatryan, Levon and Hayrapetyan, Daniil and Poghosyan, Hayk and Tadevosyan, Vahram and Wang, Zhangyang and Navasardyan, Shant and Shi, Humphrey},
  journal={arXiv preprint arXiv:2403.14773},
  year={2024}
}

Usage

Installation

First, install the VidXTend package into your python environment. If you're creating a new environment for VidXTend, be sure you also specify the version of torch you want with CUDA support, or else this will try to run only on CPU.

pip install git+https://github.com/painebenjamin/vidxtend.git

Command-Line

A command-line utility vidxtend is installed with the package.

Usage: vidxtend [OPTIONS] VIDEO PROMPT

  Run VidXtend on a video file, concatenating the generated frames to the end
  of the video.

Options:
  -fps, --frame-rate INTEGER      Video FPS. Will default to the input FPS.
  -s, --seconds FLOAT             The total number of seconds to add to the
                                  video. Multiply this number by frame rate to
                                  determine total number of new frames
                                  generated.  [default: 1.0]
  -np, --negative-prompt TEXT     Negative prompt for the diffusion process.
  -cfg, --guidance-scale FLOAT    Guidance scale for the diffusion process.
                                  [default: 7.5]
  -ns, --num-inference-steps INTEGER
                                  Number of diffusion steps.  [default: 50]
  -r, --seed INTEGER              Random seed.
  -m, --model TEXT                HuggingFace model name.
  -nh, --no-half                  Do not use half precision.
  -no, --no-offload               Do not offload to the CPU to preserve GPU
                                  memory.
  -ns, --no-slicing               Do not use VAE slicing.
  -g, --gpu-id INTEGER            GPU ID to use.
  -sf, --model-single-file        Download and use a single file instead of a
                                  directory.
  -cf, --config-file TEXT         Config file to use when using the model-
                                  single-file option. Accepts a path or a
                                  filename in the same directory as the single
                                  file. Will download from the repository
                                  passed in the model option if not provided.
                                  [default: config.json]
  -mf, --model-filename TEXT      The model file to download when using the
                                  model-single-file option.  [default:
                                  vidxtend.safetensors]
  -rs, --remote-subfolder TEXT    Remote subfolder to download from when using
                                  the model-single-file option.
  -cd, --cache-dir DIRECTORY      Cache directory to download to. Default uses
                                  the huggingface cache.
  -o, --output FILE               Output file.  [default: output.mp4]
  -f, --fit [actual|cover|contain|stretch]
                                  Image fit mode.  [default: cover]
  -a, --anchor [top-left|top-center|top-right|center-left|center-center|center-right|bottom-left|bottom-center|bottom-right]
                                  Image anchor point.  [default: top-left]
  --help                          Show this message and exit.

Python

You can create the pipeline, automatically pulling the weights from this repository, either as individual models:

from vidxtend import VidXTendPipeline
pipeline = VidXTendPipeline.from_pretrained(
  "benjamin-paine/vidxtend",
  torch_dtype=torch.float16,
  variant="fp16",
)

Or, as a single file:

from vidxtend import VidXTendPipeline
pipeline = VidXTendPipeline.from_single_file(
  "benjamin-paine/vidxtend",
  torch_dtype=torch.float16,
  variant="fp16",
)

Use these methods to improve performance:

pipeline.enable_model_cpu_offload()
pipeline.enable_vae_slicing()
pipeline.set_use_memory_efficient_attention_xformers()

Usage is as follows:

# Assume images is a list of PIL Images

new_frames = pipeline(
    prompt=prompt,
    negative_prompt=None, # Optionally use negative prompt
    image=images[-8:], # Use final 8 frames of video
    input_frames_conditioning=images[:1], # Use first frame of video
    eta=1.0,
    guidance_scale=7.5,
    output_type="pil"
).frames[8:] # Remove the first 8 frames from the output as they were used as guide for final 8