TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models Paper • 2410.23266 • Published 11 days ago • 19
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents Paper • 2410.23218 • Published 11 days ago • 43
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution Paper • 2409.12961 • Published Sep 19 • 23
MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines Paper • 2409.12959 • Published Sep 19 • 36
InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning Paper • 2409.12568 • Published Sep 19 • 47
OutfitAnyone: Ultra-high Quality Virtual Try-On for Any Clothing and Any Person Paper • 2407.16224 • Published Jul 23 • 24
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models Paper • 2405.15738 • Published May 24 • 43
Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models Paper • 2405.01535 • Published May 2 • 116
LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report Paper • 2405.00732 • Published Apr 29 • 118
view article Article Introducing Idefics2: A Powerful 8B Vision-Language Model for the community Apr 15 • 165
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs Paper • 2404.05719 • Published Apr 8 • 80
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction Paper • 2404.02905 • Published Apr 3 • 64
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models Paper • 2404.02258 • Published Apr 2 • 104
Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks? Paper • 2404.03411 • Published Apr 4 • 8