Vision - a henern Collection

henern 's Collections

RAG

Data

Context Scaling

Vision

Audio

Reports

Vision

updated Sep 21

Video/Image/Gif/etc.

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

Paper • 2402.17177 • Published Feb 27 • 88
EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions

Paper • 2402.17485 • Published Feb 27 • 188
VisionLLaMA: A Unified LLaMA Interface for Vision Tasks

Paper • 2403.00522 • Published Mar 1 • 44
PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

Paper • 2403.04692 • Published Mar 7 • 40
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

Paper • 2311.12793 • Published Nov 21, 2023 • 18
FlashFace: Human Image Personalization with High-fidelity Identity Preservation

Paper • 2403.17008 • Published Mar 25 • 19
An Introduction to Vision-Language Modeling

Paper • 2405.17247 • Published May 27 • 85
Depth Anything V2

Paper • 2406.09414 • Published Jun 13 • 92
Vision language models are blind

Paper • 2407.06581 • Published Jul 9 • 82
SAM 2: Segment Anything in Images and Videos

Paper • 2408.00714 • Published Aug 1 • 107
MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Paper • 2408.01800 • Published Aug 3 • 77
Imagen 3

Paper • 2408.07009 • Published Aug 13 • 61
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

Paper • 2408.08872 • Published Aug 16 • 97
Building and better understanding vision-language models: insights and future directions

Paper • 2408.12637 • Published Aug 22 • 117
CogVLM2: Visual Language Models for Image and Video Understanding

Paper • 2408.16500 • Published Aug 29 • 56
StoryMaker: Towards Holistic Consistent Characters in Text-to-image Generation

Paper • 2409.12576 • Published Sep 19 • 15