File size: 1,692 Bytes
2c3f6f0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2297f4c
2c3f6f0
 
 
 
 
 
 
 
d89f8b4
2c3f6f0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
---
license: cc-by-4.0
datasets:
- FreedomIntelligence/ALLaVA-4V
pipeline_tag: image-text-to-text
library_name: prismcaptioner
---
<br>

# PrismCaptioner Model Card

**Model details**

PrismCaptioners are open-source captioners with LLaVA architecture finetuned on GPT4V-assisted dataset [ALLaVA](https://huggingface.co/datasets/FreedomIntelligence/ALLaVA-4V). We have released [PrismCaptioner-7B](https://huggingface.co/Yuxuan-Qiao/PrismCaptioner-7B) and [PrismCaptioner-2B](https://huggingface.co/Yuxuan-Qiao/PrismCaptioner-7B). 

PrismCaptioner-7B details
- **Vision Backbone:** google/siglip-so400m-patch14-384
- **Language Backbone:** internlm/internlm2-7b
- **Dataset:** 1x ALLaVA-Caption-[LAION/VFLAN]

**Paper and codebase for more information:**
[[Paper](https://arxiv.org/abs/2406.14544)] [[Code](https://github.com/SparksJoe/Prism)]

**Intended uses**
- **Perception Module:** The model can be integrated into [Prism](https://github.com/SparksJoe/Prism) as a perception module to solve vision-language task by utilizing an external LLM.
- **Effective Captioner:** The model can produce high-quality captions for given images.

**Model usage**

Clone the [Prism](https://github.com/SparksJoe/Prism) repo and complete the [preparation](https://github.com/SparksJoe/Prism/tree/main?tab=readme-ov-file#preparation). You can use PrismCaptioners following [usage](https://github.com/SparksJoe/Prism/blob/main/README.md#usage) or demo below.

```python
# In the Prism repo folder
from decouple import supported_VLM

model = supported_VLM['prismcaptioner-7b']()
res = model.generate(['assets/case1.png', 'Given the image below, please provide a detailed description of what you see.'])
```