This repo contains the VPLM Dataset and pretrained checkpoints for RACCooN

RACCooN is a versatile and user-friendly video-to-paragraph-to-video generative framework that supports multiple video editing capabilities such as removal, addition, and modification, through a unified pipeline. RACCooN consists of two principal stages: Video-to-Paragraph (V2P) and Paragraph-to-Video (P2V).

RACCooN suggests a multi-granular spatiotemporal pooling strategy to generate well-structured video descriptions, capturing both the broad context and object details without requiring complex human annotations, simplifying precise video content editing based on text for users. Our video generative model incorporates auto-generated narratives or instructions to enhance the quality and accuracy of the generated content. It supports the addition of video objects, inpainting, and attribute modification within a unified framework, surpassing existing video editing and inpainting benchmarks.

Description of VPLM Dataset

Multi-Objects Description

Train: RACCooN/VPLM/gt_train.json
Test: RACCooN/VPLM/gt_test.json

Single-Object Layout Prediction

Train: RACCooN/VPLM/gt_train_layouts.json
Test: RACCooN/VPLM/gt_test_layouts.json

Description of Model Checkpoints

V2P

Multi-Objects Description

RACCooN/mllm_finetuned/multi_obj_projector.bin

Single-Object Description

RACCooN/mllm_finetuned/single_obj_projector.bin

Single-Object Layout Prediction

RACCooN/mllm_finetuned/layout_pred_projector.bin

P2V

RACCooN/unet_finetuned/diffusion_pytorch_model.safetensors