Update README.md
Browse files
README.md
CHANGED
@@ -17,4 +17,28 @@ metrics:
|
|
17 |
|
18 |
## 简介 Brief Introduction
|
19 |
|
20 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
17 |
|
18 |
## 简介 Brief Introduction
|
19 |
|
20 |
+
**Lyrics** 是IDEA CCNL研发的大规模视觉语言模型(Large Vision Language Model, LVLM)。Lyrics在预训练(视觉语言的表征对齐)和指令微调(视觉到语言的生成学习)的两阶段训练过程中,构建了视觉细化器来提取局部视觉特征和具化的空间表征,其由图像标记(RAM)、目标检测(Grounding DINO)和语义分割(SAM)模块组成。
|
21 |
+
|
22 |
+
Lyrics 可以以图像、文本、视觉对象作为输入,并以文本和视觉对象的空间表征作为输出。Lyrics模型具有强大的细粒度视觉特征提取和理解能力,能够完成各种以视觉为中心的任务,包括多回合视觉对话、视觉场景理解和推理、基于常识的图像描述、指向性问答。
|
23 |
+
|
24 |
+
**Lyrics** is a Large Vision Language Model (LVLM) developed by IDEA CCNL. In the two-stage training process of pre-training (representation alignment of vision-language) and instruction fine-tuning (generative learning from vision to language), Lyrics construct a visual refiner to extract local visual features and embodied spatial representations. It consists of image tagging (RAM), object detection (Grounding DINO) and semantic segmentation (SAM) modules.
|
25 |
+
|
26 |
+
Lyrics can take images, text, and visual objects as input, and text and spatial representations of visual objects as output. The Lyrics model has a powerful ability of fine-grained visual feature extraction and understanding, and is capable of various visual-centric tasks, including multi-turn visual conversation, visual scene understanding and reasoning, commonsense-grounded image description, referential dialogue.
|
27 |
+
|
28 |
+
|
29 |
+
## 安装要求 (Requirements)
|
30 |
+
|
31 |
+
* python 3.8及以上版本
|
32 |
+
* pytorch 1.12及以上版本
|
33 |
+
* 建议使用CUDA 11.3及以上(GPU用户需考虑此选项)
|
34 |
+
* python 3.8 and above
|
35 |
+
* pytorch 1.12 and above
|
36 |
+
* CUDA 11.3 and above are recommended (this is for GPU users)
|
37 |
+
|
38 |
+
### 零样本图像描述 & 通用视觉问答 (Zero-shot Image Captioning & General VQA)
|
39 |
+
![](assets/image_caption_vqa.jpg)
|
40 |
+
|
41 |
+
- 在 Image Captioning 中,Lyrics 在 COCO, Nocaps (0-shot) 和 Flickr30K (0-shot) 数据集上超过了同等规模的 LVLM 模型,取得了 **SOTA** 的结果。
|
42 |
+
- 在 General VQA 中,Lyrics 在四个数据集取得了 **SOTA** 的结果,并在 Vizwiz 数据集上与 Qwen-VL 旗鼓相当。
|
43 |
+
- In Image Captioning, Lyrics on COCO, Nocaps (0-shot), and Flickr30K (0-shot) datasets outperform LVLM models of the same size, achieving **SOTA** results.
|
44 |
+
- In General VQA, Lyrics achieved **SOTA** results across four datasets and tied with Qwen-VL on the Vizwiz dataset.
|