arxiv:2410.02713

Video Instruction Tuning With Synthetic Data

Published on Oct 3

· Submitted by

ZhangYuanhan on Oct 4

Upvote

Authors:

Yuanhan Zhang ,

Jinming Wu ,

Bo Li ,

Ziwei Liu ,

Chunyuan Li

Abstract

The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we propose an alternative approach by creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.

View arXiv page View PDF Add to collection

Community

ZhangYuanhan

Paper author Paper submitter Oct 4

•

edited Oct 4

We propose a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA.
We introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset.

Project page: https://llava-vl.github.io/blog/2024-09-30-llava-video/