Post
2086
π π π Happy to share our recent work. We noticed that image resolution plays an important role, either in improving multi-modal large language models (MLLM) performance or in Sora style any resolution encoder decoder, we hope this work can help lift restriction of 224x224 resolution limit in ViT.
ViTAR: Vision Transformer with Any Resolution (2403.18361)
ViTAR: Vision Transformer with Any Resolution (2403.18361)