arxiv:2402.11248

CoLLaVO: Crayon Large Language and Vision mOdel

Published on Feb 17

· Submitted by

akhaliq on Feb 20

Upvote

Authors:

Byung-Kwan Lee ,

Beomchan Park ,

Chae Won Kim ,

Yong Man Ro

Abstract

The remarkable success of Large Language Models (LLMs) and instruction tuning drives the evolution of Vision Language Models (VLMs) towards a versatile general-purpose model. Yet, it remains unexplored whether current VLMs genuinely possess quality object-level image understanding capabilities determined from 'what objects are in the image?' or 'which object corresponds to a specified bounding box?'. Our findings reveal that the image understanding capabilities of current VLMs are strongly correlated with their zero-shot performance on Vision Language (VL) tasks. This suggests that prioritizing basic image understanding is crucial for VLMs to excel at VL tasks. To enhance object-level image understanding, we propose Crayon Large Language and Vision mOdel (CoLLaVO), which incorporates instruction tuning with crayon prompt as a new visual prompt tuning scheme based on panoptic color maps. Furthermore, we present a learning strategy of Dual QLoRA to preserve object-level image understanding without forgetting it during visual instruction tuning, thereby achieving a significant leap in zero-shot numerous VL benchmarks.

View arXiv page View PDF Add to collection

Community

librarian-bot

Feb 21

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

BK-Lee

Paper author Feb 21

•

edited Mar 1

You can access the code of CoLLaVO-7B by https://github.com/ByungKwanLee/CoLLaVO

merve

Mar 5

@BK-Lee would you like to host the model and the demo on Hugging Face?

BK-Lee

Paper author Mar 6

Yes! I am preparing the code first and then will upload the hosting model on huggingface space. Now, we are also preparing follow-up large language and vision model for more strong performance, so we plan to upload simultaneously. Thanks for your interest!