linxi's picture

linxi

linxi

·

AI & ML interests

None yet

Organizations

linxi's activity

upvoted 2 papers 5 days ago

TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models

Paper • 2410.23266 • Published 11 days ago • 19

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Paper • 2410.23218 • Published 11 days ago • 43

upvoted 3 papers about 2 months ago

Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution

Paper • 2409.12961 • Published Sep 19 • 23

MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines

Paper • 2409.12959 • Published Sep 19 • 36

InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning

Paper • 2409.12568 • Published Sep 19 • 47

upvoted a paper 2 months ago

OLMoE: Open Mixture-of-Experts Language Models

Paper • 2409.02060 • Published Sep 3 • 77

upvoted 4 papers 3 months ago

OmniParser for Pure Vision Based GUI Agent

Paper • 2408.00203 • Published Aug 1 • 23

SAM 2: Segment Anything in Images and Videos

Paper • 2408.00714 • Published Aug 1 • 107

OutfitAnyone: Ultra-high Quality Virtual Try-On for Any Clothing and Any Person

Paper • 2407.16224 • Published Jul 23 • 24

LAMBDA: A Large Model Based Data Agent

Paper • 2407.17535 • Published Jul 24 • 34

upvoted 4 papers 6 months ago

ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models

Paper • 2405.15738 • Published May 24 • 43

What matters when building vision-language models?

Paper • 2405.02246 • Published May 3 • 98

Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models

Paper • 2405.01535 • Published May 2 • 116

LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report

Paper • 2405.00732 • Published Apr 29 • 118

upvoted a paper 7 months ago

FlowMind: Automatic Workflow Generation with LLMs

Paper • 2404.13050 • Published Mar 17 • 32

upvoted an article 7 months ago

Article

Introducing Idefics2: A Powerful 8B Vision-Language Model for the community

Apr 15

• 165

upvoted 4 papers 7 months ago

Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

Paper • 2404.05719 • Published Apr 8 • 80

Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction

Paper • 2404.02905 • Published Apr 3 • 64

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

Paper • 2404.02258 • Published Apr 2 • 104

Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks?

Paper • 2404.03411 • Published Apr 4 • 8