Intel
/

llava-gemma-2b

Image-Text-to-Text

Inference Endpoints

Model card Files Files and versions Community

llava-gemma-2b / README.md

musashihinck's picture

Initial model card

cb4912f 9 months ago

|

3.21 kB

	---
	license_name: gemma-terms
	license_link: https://ai.google.dev/gemma/terms
	language:
	- en
	---

	# LLaVA-Gemma Model Card

	_This model card corresponds to the 2B version of the model with the CLIP-based vision encoder._

	## Overview

	`llava-gemma-2b` is a large multimodal model (LMM) trained using the [LLaVA-v1.5 framework](https://arxiv.org/abs/2310.03744) with the 2-billion parameter `google/gemma-2b-it` model as language backbone.

	## Uses

	The model has been finetuned for multimodal benchmark evaluations, but can also be used as a multimodal chatbot.


	## Bias, Risks, and Limitations

	This model has not been assessed for harm or biases, and should not be used for sensitive applications where it may cause harm.


	## How to Get Started with the Model

	Using the LLaVA-Gemma models currently requires a custom fork of the [`LLaVA`](https://github.com/haotian-liu/LLaVA) library. _We will release converted checkpoints compatible with the HuggingFace implementation of LLaVA shortly._




	## Training Details

	The `llava-gemma-2b` model was trained on 8 Gaudi 2 accelerators.


	### Training Data

	The model was trained using the LLaVA-v1.5 data mixture.

	This is listed as follows:

	- 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
	- 158K GPT-generated multimodal instruction-following data.
	- 450K academic-task-oriented VQA data mixture.
	- 40K ShareGPT data.


	## Evaluation

	\| LM Backbone \| Vision Model \| Pretrained Connector \| GQA \| MME cognition \| MME perception \| MM-Vet \| POPE accuracy \| POPE F1 \| VQAv2 \| TextVQA \| ScienceQA Image \| MMVP \|
	\| ------------ \| ------------- \| --------------------- \| ------ \| ---------------- \| ----------------- \| ------- \| ------------------ \| ------------ \| ------ \| -------- \| -------------------- \| ------ \|
	\| gemma-2b-it \| CLIP \| Yes \| 0.531 \| 236.071 \| 1130.492 \| 17.706 \| 0.850 \| 0.839 \| 70.65 \| 28.06 \| 0.564 \| 0.287 \|
	\| gemma-2b-it \| CLIP \| No \| 0.481 \| 247.857 \| 934.611 \| 13.119 \| 0.784 \| 0.762 \| 61.74 \| \| 0.549 \| 0.180 \|
	\| gemma-7b-it \| CLIP \| Yes \| 0.472 \| 253.571 \| 894.910 \| 18.165 \| 0.848 \| 0.829 \| 68.7 \| \| 0.625 \| 0.327 \|
	\| gemma-7b-it \| CLIP \| No \| 0.472 \| 278.214 \| 857.274 \| 19.083 \| 0.782 \| 0.734 \| 65.09 \| \| 0.636 \| 0.240 \|
	\| gemma-2b-it \| DinoV2 \| Yes \| 0.587 \| 307.143 \| 1132.970 \| 19.128 \| 0.853 \| 0.838 \| 71.37 \| 12.53 \| 0.555 \| 0.227 \|
	\| gemma-2b-it \| DinoV2 \| No \| 0.501 \| 308.929 \| 959.351 \| 14.541 \| 0.793 \| 0.772 \| 61.65 \| 11.1 \| 0.568 \| 0.180 \|