IDEA-CCNL
/

Ziya-Visual-Lyrics-14B

Text2Text Generation

visual question answering

image captioning

visual-centric dialogue

Inference Endpoints

Model card Files Files and versions Community

LinhIcey commited on Dec 24, 2023

Commit

d54fbe3

·

1 Parent(s): 9958185

Update README.md

Files changed (1) hide show

README.md +25 -1

README.md CHANGED Viewed

@@ -17,4 +17,28 @@ metrics:
 ## 简介 Brief Introduction
-Ziya-Visual

 ## 简介 Brief Introduction
+**Lyrics** 是IDEA CCNL研发的大规模视觉语言模型（Large Vision Language Model, LVLM）。Lyrics在预训练（视觉语言的表征对齐）和指令微调（视觉到语言的生成学习）的两阶段训练过程中，构建了视觉细化器来提取局部视觉特征和具化的空间表征，其由图像标记(RAM)、目标检测(Grounding DINO)和语义分割(SAM)模块组成。
+Lyrics 可以以图像、文本、视觉对象作为输入，并以文本和视觉对象的空间表征作为输出。Lyrics模型具有强大的细粒度视觉特征提取和理解能力，能够完成各种以视觉为中心的任务，包括多回合视觉对话、视觉场景理解和推理、基于常识的图像描述、指向性问答。
+**Lyrics** is a Large Vision Language Model (LVLM) developed by IDEA CCNL. In the two-stage training process of pre-training (representation alignment of vision-language) and instruction fine-tuning (generative learning from vision to language), Lyrics construct a visual refiner to extract local visual features and embodied spatial representations. It consists of image tagging (RAM), object detection (Grounding DINO) and semantic segmentation (SAM) modules.
+Lyrics can take images, text, and visual objects as input, and text and spatial representations of visual objects as output. The Lyrics model has a powerful ability of fine-grained visual feature extraction and understanding, and is capable of various visual-centric tasks, including multi-turn visual conversation, visual scene understanding and reasoning, commonsense-grounded image description, referential dialogue.
+## 安装要求 (Requirements)
+* python 3.8及以上版本
+* pytorch 1.12及以上版本
+* 建议使用CUDA 11.3及以上（GPU用户需考虑此选项）
+* python 3.8 and above
+* pytorch 1.12 and above
+* CUDA 11.3 and above are recommended (this is for GPU users)
+### 零样本图像描述 & 通用视觉问答 (Zero-shot Image Captioning & General VQA)
+![](assets/image_caption_vqa.jpg)
+- 在 Image Captioning 中，Lyrics 在 COCO, Nocaps (0-shot) 和 Flickr30K (0-shot) 数据集上超过了同等规模的 LVLM 模型，取得了 **SOTA** 的结果。
+- 在 General VQA 中，Lyrics 在四个数据集取得了 **SOTA** 的结果，并在 Vizwiz 数据集上与 Qwen-VL 旗鼓相当。
+- In Image Captioning, Lyrics on COCO, Nocaps (0-shot), and Flickr30K (0-shot) datasets outperform LVLM models of the same size, achieving **SOTA** results.
+- In General VQA, Lyrics achieved **SOTA** results across four datasets and tied with Qwen-VL on the Vizwiz dataset.