CoLLaVO: Crayon Large Language and Vision mOdel
Abstract
The remarkable success of Large Language Models (LLMs) and instruction tuning drives the evolution of Vision Language Models (VLMs) towards a versatile general-purpose model. Yet, it remains unexplored whether current VLMs genuinely possess quality object-level image understanding capabilities determined from 'what objects are in the image?' or 'which object corresponds to a specified bounding box?'. Our findings reveal that the image understanding capabilities of current VLMs are strongly correlated with their zero-shot performance on Vision Language (VL) tasks. This suggests that prioritizing basic image understanding is crucial for VLMs to excel at VL tasks. To enhance object-level image understanding, we propose Crayon Large Language and Vision mOdel (CoLLaVO), which incorporates instruction tuning with crayon prompt as a new visual prompt tuning scheme based on panoptic color maps. Furthermore, we present a learning strategy of Dual QLoRA to preserve object-level image understanding without forgetting it during visual instruction tuning, thereby achieving a significant leap in zero-shot numerous VL benchmarks.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- 3DMIT: 3D Multi-modal Instruction Tuning for Scene Understanding (2024)
- GroundingGPT:Language Enhanced Multi-modal Grounding Model (2024)
- Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey (2023)
- Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study (2024)
- Proximity QA: Unleashing the Power of Multi-Modal Large Language Models for Spatial Proximity Analysis (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Yes! I am preparing the code first and then will upload the hosting model on huggingface space. Now, we are also preparing follow-up large language and vision model for more strong performance, so we plan to upload simultaneously. Thanks for your interest!
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper