OpenGVLab
/

InternVL-Chat-V1-5

@@ -12,38 +12,42 @@ pipeline_tag: visual-question-answering
 # Model Card for InternVL-Chat-V1.5
-\[[Paper](https://arxiv.org/abs/2312.14238)\]  \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\]
 ## Model Details
-- **Model Type:** vision large language model, multimodal chatbot
 - **Model Stats:**
   - Architecture: [InternViT-6B-448px-V1-5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5) + MLP + [InternLM2-Chat-20B](https://huggingface.co/internlm/internlm2-chat-20b)
   - Params: 25.5B
-  - Image size: dynamic resolution, max to 40 tiles of 448 x 448 during inference.
-  - Number of visual tokens: 256 * (number of tiles + 1)
 - **Training Strategy:**
   - Pretraining Stage
     - Learnable Component: ViT + MLP
-    - Data: TODO
   - SFT Stage
     - Learnable Component: ViT + MLP + LLM
-    - Data: TODO
 ## Model Usage
-We provide a minimum code example to run InternVL-Chat using only the `transformers` library.
 You also can use our [online demo](https://internvl.opengvlab.com/) for a quick experience of this model.
-Note: If you meet this error `ImportError: This modeling file requires the following packages that were not found in your environment: fastchat`, please run `pip install fschat`.
 ```python
 import json
 import os
-from internvl.model.internvl_chat import InternVLChatModel
 from transformers import AutoTokenizer, AutoModel
 from tqdm import tqdm
 import torch

 # Model Card for InternVL-Chat-V1.5
+<img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/AjPIKaxKLZCbzQRrPELPB.webp" alt="Image Description" width="300" height="300">
+\[[Paper](https://arxiv.org/abs/2312.14238)\]  \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\] \[[中文解读](https://zhuanlan.zhihu.com/p/675877376)]
+| Model                   | Date       | Download                                                                    | Note                               |
+| ----------------------- | ---------- | --------------------------------------------------------------------------- | ---------------------------------- |
+| InternVL-Chat-V1.5      | 2024.04.18 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5)            | support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. (🔥new)|
+| InternVL-Chat-V1.2-Plus | 2024.02.21 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus)       | more SFT data and stronger  |
+| InternVL-Chat-V1.2      | 2024.02.11 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2)            | scaling up LLM to 34B       |
+| InternVL-Chat-V1.1      | 2024.01.24 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1)            | support Chinese and stronger OCR   |
 ## Model Details
+- **Model Type:** multimodal large language model (MLLM)
 - **Model Stats:**
   - Architecture: [InternViT-6B-448px-V1-5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5) + MLP + [InternLM2-Chat-20B](https://huggingface.co/internlm/internlm2-chat-20b)
+  - Image size: dynamic resolution, max to 32 tiles of 448 x 448 (4K resolution) during inference.
   - Params: 25.5B
 - **Training Strategy:**
   - Pretraining Stage
     - Learnable Component: ViT + MLP
+    - Data: Please see our technical report.
   - SFT Stage
     - Learnable Component: ViT + MLP + LLM
+    - Data: Please see our technical report.
 ## Model Usage
+We provide an example code to run InternVL-Chat-V1.2 using `transformers`.
 You also can use our [online demo](https://internvl.opengvlab.com/) for a quick experience of this model.
 ```python
 import json
 import os
 from transformers import AutoTokenizer, AutoModel
 from tqdm import tqdm
 import torch