Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
vladbogo 
posted an update Feb 14
Post
Meta Reality Labs has developed Lumos, a system that merges Multimodal Large Language Models (MM-LLMs) with Scene Text Recognition (STR) to boost the efficiency of various tasks such as multimodal question-answering and text summarization.

Key aspects of Lumos include:

* Hybrid Computing: Utilizes a combination of on-device and cloud computing to process inputs, aiming to reduce latency.
* STR Components:
* Region of Interest (ROI) Detection: Focuses on text-rich areas within images for optimized text extraction.
* Text Detection and Recognition: Ensures high-quality text recognition within the ROI.
* Reading Order Reconstruction: Arranges recognized text to mimic natural reading order, essential for context understanding.

Lumos demonstrates significant improvement with 80% accuracy in question-answering benchmarks and a low word error rate.

Paper: Lumos : Empowering Multimodal LLMs with Scene Text Recognition (2402.08017)

Congrats to the authors for their work!

Cool. How to access it?

·

As far as I could find it’s not yet available. Hopefully the authors will release it soon 🤞