@vladbogo on Hugging Face: "Meta Reality Labs has developed Lumos, a system that merges Multimodal Large…"

Hugging Face

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Back to feed

vladbogo

posted an update Feb 14

Post

Meta Reality Labs has developed Lumos, a system that merges Multimodal Large Language Models (MM-LLMs) with Scene Text Recognition (STR) to boost the efficiency of various tasks such as multimodal question-answering and text summarization.

Key aspects of Lumos include:

* Hybrid Computing: Utilizes a combination of on-device and cloud computing to process inputs, aiming to reduce latency.
* STR Components:
* Region of Interest (ROI) Detection: Focuses on text-rich areas within images for optimized text extraction.
* Text Detection and Recognition: Ensures high-quality text recognition within the ROI.
* Reading Order Reconstruction: Arranges recognized text to mimic natural reading order, essential for context understanding.

Lumos demonstrates significant improvement with 80% accuracy in question-answering benchmarks and a low word error rate.

Paper: Lumos : Empowering Multimodal LLMs with Scene Text Recognition (2402.08017)

Congrats to the authors for their work!

QuantumResearch

Feb 20

Cool. How to access it?

vladbogo

Feb 20

As far as I could find it’s not yet available. Hopefully the authors will release it soon 🤞

In this post