LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture
Abstract
Expanding the long-context capabilities of Multi-modal Large Language Models~(MLLMs) is crucial for video understanding, high-resolution image understanding, and multi-modal agents. This involves a series of systematic optimizations, including model architecture, data construction and training strategy, particularly addressing challenges such as degraded performance with more images and high computational costs. In this paper, we adapt the model architecture to a hybrid of Mamba and Transformer blocks, approach data construction with both temporal and spatial dependencies among multiple images and employ a progressive training strategy. The released model LongLLaVA~(Long-Context Large Language and Vision Assistant) is the first hybrid MLLM, which achieved a better balance between efficiency and effectiveness. LongLLaVA not only achieves competitive results across various benchmarks, but also maintains high throughput and low memory consumption. Especially, it could process nearly a thousand images on a single A100 80GB GPU, showing promising application prospects for a wide range of tasks.
Community
- We introduce LongLLaVA, a solution optimized through data construction, training strategies, and multi-modal architecture, effectively balancing performance and efficiency. To the best of our knowledge, this is the first hybrid architecture for MLLMs.
- LongLLaVA demonstrates exceptional performance in multi-modal long-context understanding, excelling in retrieval, counting, and ordering tasks.
- In our commitment to transparency and community research, we will open source all models, codes, and datasets associated with LongLLaVA.
- Code: https://github.com/FreedomIntelligence/LongLLaVA
- Model: https://huggingface.co/FreedomIntelligence/LongLLaVA
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models (2024)
- MMR: Evaluating Reading Ability of Large Multimodal Models (2024)
- VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation (2024)
- MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model (2024)
- VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 4
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper