arxiv:2409.17146

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

Published on Sep 25

· Submitted by

akhaliq on Sep 26

#1 Paper of the day

Upvote

100

Authors:

Matt Deitke ,

Christopher Clark ,

Rohun Tripathi ,

Yue Yang ,

Jae Sung Park ,

Mohammadreza Salehi ,

Niklas Muennighoff ,

Luca Soldaini ,

Jiasen Lu ,

Taira Anderson ,

Erin Bransom ,

Huong Ngo ,

YenSung Chen ,

Ajay Patel ,

Mark Yatskar ,

Chris Callison-Burch ,

Andrew Head ,

Rose Hendrix ,

Favyen Bastani

Abstract

Today's most advanced multimodal models remain proprietary. The strongest open-weight models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling these closed models into open ones. As a result, the community is still missing foundational knowledge about how to build performant VLMs from scratch. We present Molmo, a new family of VLMs that are state-of-the-art in their class of openness. Our key innovation is a novel, highly detailed image caption dataset collected entirely from human annotators using speech-based descriptions. To enable a wide array of user interactions, we also introduce a diverse dataset mixture for fine-tuning that includes in-the-wild Q&A and innovative 2D pointing data. The success of our approach relies on careful choices for the model architecture details, a well-tuned training pipeline, and, most critically, the quality of our newly collected datasets, all of which will be released. The best-in-class 72B model within the Molmo family not only outperforms others in the class of open weight and data models but also compares favorably against proprietary systems like GPT-4o, Claude 3.5, and Gemini 1.5 on both academic benchmarks and human evaluation. We will be releasing all of our model weights, captioning and fine-tuning data, and source code in the near future. Select model weights, inference code, and demo are available at https://molmo.allenai.org.

View arXiv page View PDF Add to collection

Community

akhaliq

Paper submitter Sep 26

https://molmo.allenai.org/

BK-Lee

Sep 26

Really interesting work! but its great performance seems biased for academic dataset because AI2D and Document dataset has good performances while mathvista and mmmu for more reasoning tasks are too low compared with Qwen2-VL. I would like to see the performances of more challenging benchmarks such as MM-Vet, MM-Vet-v2, MathVerse, LLaVA-wilder, and so on.