Flash-VStream Model Card
Model details
We proposed Flash-VStream, a video-language model that simulates the memory mechanism of human. Our model is able to process extremely long video streams in real-time and respond to user queries simultaneously.
Training data
This model is trained based on image data from LLaVA-1.5 dataset, and video data from WebVid and ActivityNet datasets following LLaMA-VID, including
- 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
- 158K GPT-generated multimodal instruction-following data.
- 450K academic-task-oriented VQA data mixture.
- 40K ShareGPT data.
- 232K video-caption pairs sampled from the WebVid 2.5M dataset.
- 98K videos from ActivityNet with QA pairs from Video-ChatGPT.
License
This project is licensed under the LLAMA 2 License.
- Downloads last month
- 532
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.