Qwen
/

Qwen-VL

Text Generation

Model card Files Files and versions Community

simonJJJ commited on Aug 22, 2023

Commit

eb9e7e6

•

1 Parent(s): 99bbed2

update

Files changed (1) hide show

README.md +0 -6

README.md CHANGED Viewed

@@ -30,12 +30,6 @@ inference: false
 - **First generalist model support grounding in Chinese**: Detecting bounding boxes through open-domain language expression in both Chinese and English.
 - **Fine-grained recognization and understanding**: Compared to the 224 resolution currently used by other open-source LVLM, the 448 resolution promotes fine-grained text recognition, document QA, and bounding box annotation.
-<br>
-<p align="center">
-    <img src="assets/demo_vl.gif" width="400"/>
-<p>
-<br>
 We release two models of the Qwen-VL series:
 - Qwen-VL: The pre-trained LVLM model uses Qwen-7B as the initialization of the LLM, and [Openclip ViT-bigG](https://github.com/mlfoundations/open_clip) as the initialization of the visual encoder. And connects them with a randomly initialized cross-attention layer. Qwen-VL was trained on about 1.5B image-text paired data. The final image input resolution is 448.
 - Qwen-VL-Chat: A multimodal LLM-based AI assistant, which is trained with alignment techniques.

 - **First generalist model support grounding in Chinese**: Detecting bounding boxes through open-domain language expression in both Chinese and English.
 - **Fine-grained recognization and understanding**: Compared to the 224 resolution currently used by other open-source LVLM, the 448 resolution promotes fine-grained text recognition, document QA, and bounding box annotation.
 We release two models of the Qwen-VL series:
 - Qwen-VL: The pre-trained LVLM model uses Qwen-7B as the initialization of the LLM, and [Openclip ViT-bigG](https://github.com/mlfoundations/open_clip) as the initialization of the visual encoder. And connects them with a randomly initialized cross-attention layer. Qwen-VL was trained on about 1.5B image-text paired data. The final image input resolution is 448.
 - Qwen-VL-Chat: A multimodal LLM-based AI assistant, which is trained with alignment techniques.