IDEA-CCNL
/

Ziya-Visual-Lyrics-14B

@@ -17,14 +17,21 @@ metrics:
 ## 简介 Brief Introduction
-**Lyrics** 是IDEA CCNL研发的大规模视觉语言模型（Large Vision Language Model, LVLM）。Lyrics在预训练（视觉语言的表征对齐）和指令微调（视觉到语言的生成学习）的两阶段训练过程中，构建了视觉细化器来提取局部视觉特征和具化的空间表征，其由图像标记(RAM)、目标检测(Grounding DINO)和语义分割(SAM)模块组成。
 Lyrics 可以以图像、文本、视觉对象作为输入，并以文本和视觉对象的空间表征作为输出。Lyrics模型具有强大的细粒度视觉特征提取和理解能力，能够完成各种以视觉为中心的任务，包括多回合视觉对话、视觉场景理解和推理、基于常识的图像描述、指向性问答。
-**Lyrics** is a Large Vision Language Model (LVLM) developed by IDEA CCNL. In the two-stage training process of pre-training (representation alignment of vision-language) and instruction fine-tuning (generative learning from vision to language), Lyrics construct a visual refiner to extract local visual features and embodied spatial representations. It consists of image tagging (RAM), object detection (Grounding DINO) and semantic segmentation (SAM) modules.
 Lyrics can take images, text, and visual objects as input, and text and spatial representations of visual objects as output. The Lyrics model has a powerful ability of fine-grained visual feature extraction and understanding, and is capable of various visual-centric tasks, including multi-turn visual conversation, visual scene understanding and reasoning, commonsense-grounded image description, referential dialogue.
 ## 安装要求 (Requirements)

 ## 简介 Brief Introduction
+**Lyrics** 是IDEA CCNL研发的大规模视觉语言模型（Large Vision Language Model, LVLM）。Lyrics在预训练（视觉语言的表征对齐）和指令微调（视觉到语言的生成学习）的两阶段训练过程中，构建了视觉细化器来提取局部视觉特征和具化的空间表征，其由图像标记(RAM)、目标检测(Grounding DINO)和语义分割(SAM)模块组成。该方法可以防止细粒度的视觉对象的缺失，造成模型产生不可修复的视觉幻觉和事实错误。
 Lyrics 可以以图像、文本、视觉对象作为输入，并以文本和视觉对象的空间表征作为输出。Lyrics模型具有强大的细粒度视觉特征提取和理解能力，能够完成各种以视觉为中心的任务，包括多回合视觉对话、视觉场景理解和推理、基于常识的图像描述、指向性问答。
+![](assets/two_stage_training.png)
+**Lyrics** is a Large Vision Language Model (LVLM) developed by IDEA CCNL. In the two-stage training process of pre-training (representation alignment of vision-language) and instruction fine-tuning (generative learning from vision to language), Lyrics construct a visual refiner to extract local visual features and embodied spatial representations. It consists of image tagging (RAM), object detection (Grounding DINO) and semantic segmentation (SAM) modules. This method can prevent the absence of fine-grained visual objects, causing irreparable visual hallucinations and factual errors in the model.
 Lyrics can take images, text, and visual objects as input, and text and spatial representations of visual objects as output. The Lyrics model has a powerful ability of fine-grained visual feature extraction and understanding, and is capable of various visual-centric tasks, including multi-turn visual conversation, visual scene understanding and reasoning, commonsense-grounded image description, referential dialogue.
+## 模型结构 Brief Introduction
+**Lyrics**
+![](assets/two_stage_training.png)
 ## 安装要求 (Requirements)