-
How Far Are We from Intelligent Visual Deductive Reasoning?
Paper • 2403.04732 • Published • 18 -
MoAI: Mixture of All Intelligence for Large Language and Vision Models
Paper • 2403.07508 • Published • 75 -
DragAnything: Motion Control for Anything using Entity Representation
Paper • 2403.07420 • Published • 13 -
Learning and Leveraging World Models in Visual Representation Learning
Paper • 2403.00504 • Published • 31
Collections
Discover the best community collections!
Collections including paper arxiv:2403.07420
-
DragAnything: Motion Control for Anything using Entity Representation
Paper • 2403.07420 • Published • 13 -
Capabilities of Gemini Models in Medicine
Paper • 2404.18416 • Published • 22 -
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos
Paper • 2406.08407 • Published • 24
-
Video as the New Language for Real-World Decision Making
Paper • 2402.17139 • Published • 18 -
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
Paper • 2310.19512 • Published • 15 -
VideoMamba: State Space Model for Efficient Video Understanding
Paper • 2403.06977 • Published • 27 -
VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models
Paper • 2401.09047 • Published • 13
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 25 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 12 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 38 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 19
-
MagicVideo-V2: Multi-Stage High-Aesthetic Video Generation
Paper • 2401.04468 • Published • 47 -
VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models
Paper • 2401.09047 • Published • 13 -
AnimateLCM: Accelerating the Animation of Personalized Diffusion Models and Adapters with Decoupled Consistency Learning
Paper • 2402.00769 • Published • 20 -
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
Paper • 2402.03161 • Published • 14