license_name: gemma-terms
license_link: https://ai.google.dev/gemma/terms
language:
- en
LLaVA-Gemma Model Card
This model card corresponds to the 2B version of the model with the CLIP-based vision encoder.
Overview
llava-gemma-2b
is a large multimodal model (LMM) trained using the LLaVA-v1.5 framework with the 2-billion parameter google/gemma-2b-it
model as language backbone.
Uses
The model has been finetuned for multimodal benchmark evaluations, but can also be used as a multimodal chatbot.
Bias, Risks, and Limitations
This model has not been assessed for harm or biases, and should not be used for sensitive applications where it may cause harm.
How to Get Started with the Model
Using the LLaVA-Gemma models currently requires a custom fork of the LLaVA
library. We will release converted checkpoints compatible with the HuggingFace implementation of LLaVA shortly.
Training Details
The llava-gemma-2b
model was trained on 8 Gaudi 2 accelerators.
Training Data
The model was trained using the LLaVA-v1.5 data mixture.
This is listed as follows:
- 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
- 158K GPT-generated multimodal instruction-following data.
- 450K academic-task-oriented VQA data mixture.
- 40K ShareGPT data.
Evaluation
LM Backbone | Vision Model | Pretrained Connector | GQA | MME cognition | MME perception | MM-Vet | POPE accuracy | POPE F1 | VQAv2 | TextVQA | ScienceQA Image | MMVP |
---|---|---|---|---|---|---|---|---|---|---|---|---|
gemma-2b-it | CLIP | Yes | 0.531 | 236.071 | 1130.492 | 17.706 | 0.850 | 0.839 | 70.65 | 28.06 | 0.564 | 0.287 |
gemma-2b-it | CLIP | No | 0.481 | 247.857 | 934.611 | 13.119 | 0.784 | 0.762 | 61.74 | | 0.549 | 0.180 |
gemma-7b-it | CLIP | Yes | 0.472 | 253.571 | 894.910 | 18.165 | 0.848 | 0.829 | 68.7 | | 0.625 | 0.327 |
gemma-7b-it | CLIP | No | 0.472 | 278.214 | 857.274 | 19.083 | 0.782 | 0.734 | 65.09 | | 0.636 | 0.240 |
gemma-2b-it | DinoV2 | Yes | 0.587 | 307.143 | 1132.970 | 19.128 | 0.853 | 0.838 | 71.37 | 12.53 | 0.555 | 0.227 |
gemma-2b-it | DinoV2 | No | 0.501 | 308.929 | 959.351 | 14.541 | 0.793 | 0.772 | 61.65 | 11.1 | 0.568 | 0.180 |