Yuxuan-Qiao
/

PrismCaptioner-7B

Image-Text-to-Text

Model card Files Files and versions Community

PrismCaptioner-7B / README.md

Yuxuan-Qiao's picture

Update README.md

d89f8b4 verified about 1 month ago

|

No virus

1.69 kB

	---
	license: cc-by-4.0
	datasets:
	- FreedomIntelligence/ALLaVA-4V
	pipeline_tag: image-text-to-text
	library_name: prismcaptioner
	---
	<br>

	# PrismCaptioner Model Card

	Model details

	PrismCaptioners are open-source captioners with LLaVA architecture finetuned on GPT4V-assisted dataset [ALLaVA](https://huggingface.co/datasets/FreedomIntelligence/ALLaVA-4V). We have released [PrismCaptioner-7B](https://huggingface.co/Yuxuan-Qiao/PrismCaptioner-7B) and [PrismCaptioner-2B](https://huggingface.co/Yuxuan-Qiao/PrismCaptioner-7B).

	PrismCaptioner-7B details
	- Vision Backbone: google/siglip-so400m-patch14-384
	- Language Backbone: internlm/internlm2-7b
	- Dataset: 1x ALLaVA-Caption-[LAION/VFLAN]

	Paper and codebase for more information:
	[[Paper](https://arxiv.org/abs/2406.14544)] [[Code](https://github.com/SparksJoe/Prism)]

	Intended uses
	- Perception Module: The model can be integrated into [Prism](https://github.com/SparksJoe/Prism) as a perception module to solve vision-language task by utilizing an external LLM.
	- Effective Captioner: The model can produce high-quality captions for given images.

	Model usage

	Clone the [Prism](https://github.com/SparksJoe/Prism) repo and complete the [preparation](https://github.com/SparksJoe/Prism/tree/main?tab=readme-ov-file#preparation). You can use PrismCaptioners following [usage](https://github.com/SparksJoe/Prism/blob/main/README.md#usage) or demo below.

	```python
	# In the Prism repo folder
	from decouple import supported_VLM

	model = supported_VLM['prismcaptioner-7b']()
	res = model.generate(['assets/case1.png', 'Given the image below, please provide a detailed description of what you see.'])
	```