|
--- |
|
license: gemma |
|
library_name: transformers |
|
extra_gated_heading: Access PaliGemma on Hugging Face |
|
extra_gated_prompt: To access PaliGemma on Hugging Face, you’re required to review |
|
and agree to Google’s usage license. To do this, please ensure you’re logged-in |
|
to Hugging Face and click below. Requests are processed immediately. |
|
extra_gated_button_content: Acknowledge license |
|
pipeline_tag: image-text-to-text |
|
--- |
|
# PaliGemma model card |
|
|
|
**Model page:** [PaliGemma](https://ai.google.dev/gemma/docs/paligemma) |
|
|
|
Transformers PaliGemma 3B weights, pre-trained with 224*224 input images and 128 token input/output text sequences. The models are available in float32, bfloat16 and float16 formats for fine-tuning. |
|
|
|
**Resources and technical documentation:** |
|
|
|
* [Responsible Generative AI Toolkit](https://ai.google.dev/responsible) |
|
* [PaliGemma on Kaggle](https://www.kaggle.com/models/google/paligemma) |
|
* [PaliGemma on Vertex Model Garden](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/363) |
|
|
|
**Terms of Use:** [Terms](https://ai.google.dev/gemma/terms) |
|
|
|
**Authors:** Google |
|
|
|
## Model information |
|
|
|
### Model summary |
|
|
|
#### Description |
|
|
|
PaliGemma is a versatile and lightweight vision-language model (VLM) inspired by |
|
[PaLI-3](https://arxiv.org/abs/2310.09199) and based on open components such as |
|
the [SigLIP vision model](https://arxiv.org/abs/2303.15343) and the [Gemma |
|
language model](https://arxiv.org/abs/2403.08295). It takes both image and text |
|
as input and generates text as output, supporting multiple languages. It is designed for class-leading fine-tune performance on a wide range of vision-language tasks such as image and short video caption, visual question answering, text reading, object detection and object segmentation. |
|
|
|
#### Model architecture |
|
|
|
PaliGemma is the composition of a [Transformer |
|
decoder](https://arxiv.org/abs/1706.03762) and a [Vision Transformer image |
|
encoder](https://arxiv.org/abs/2010.11929), with a total of 3 billion |
|
params. The text decoder is initialized from |
|
[Gemma-2B](https://www.kaggle.com/models/google/gemma). The image encoder is |
|
initialized from |
|
[SigLIP-So400m/14](https://colab.research.google.com/github/google-research/big_vision/blob/main/big_vision/configs/proj/image_text/SigLIP_demo.ipynb). |
|
PaliGemma is trained following the PaLI-3 recipes. |
|
|
|
#### Inputs and outputs |
|
|
|
* **Input:** Image and text string, such as a prompt to caption the image, or |
|
a question. |
|
* **Output:** Generated text in response to the input, such as a caption of |
|
the image, an answer to a question, a list of object bounding box |
|
coordinates, or segmentation codewords. |
|
|
|
### Model data |
|
|
|
#### Pre-train datasets |
|
|
|
PaliGemma is pre-trained on the following mixture of datasets: |
|
|
|
* **WebLI:** [WebLI (Web Language Image)](https://arxiv.org/abs/2209.06794) is |
|
a web-scale multilingual image-text dataset built from the public web. A |
|
wide range of WebLI splits are used to acquire versatile model capabilities, |
|
such as visual semantic understanding, object localization, |
|
visually-situated text understanding, multilinguality, etc. |
|
* **CC3M-35L:** Curated English image-alt_text pairs from webpages ([Sharma et |
|
al., 2018](https://aclanthology.org/P18-1238/)). We used the [Google Cloud |
|
Translation API](https://cloud.google.com/translate) to translate into 34 |
|
additional languages. |
|
* **VQ²A-CC3M-35L/VQG-CC3M-35L:** A subset of VQ2A-CC3M ([Changpinyo et al., |
|
2022a](https://aclanthology.org/2022.naacl-main.142/)), translated into the |
|
same additional 34 languages as CC3M-35L, using the [Google Cloud |
|
Translation API](https://cloud.google.com/translate). |
|
* **OpenImages:** Detection and object-aware questions and answers |
|
([Piergiovanni et al. 2022](https://arxiv.org/abs/2209.04372)) generated by |
|
handcrafted rules on the [OpenImages dataset]. |
|
* **WIT:** Images and texts collected from Wikipedia ([Srinivasan et al., |
|
2021](https://arxiv.org/abs/2103.01913)). |
|
|
|
[OpenImages dataset]: https://storage.googleapis.com/openimages/web/factsfigures_v7.html |
|
|
|
#### Data responsibility filtering |
|
|
|
The following filters are applied to WebLI, with the goal of training PaliGemma |
|
on clean data: |
|
|
|
* **Pornographic image filtering:** This filter removes images deemed to be of |
|
pornographic nature. |
|
* **Text safety filtering:** We identify and filter out images that are paired |
|
with unsafe text. Unsafe text is any text deemed to contain or be about |
|
CSAI, pornography, vulgarities, or otherwise offensive. |
|
* **Text toxicity filtering:** We further use the [Perspective |
|
API](https://perspectiveapi.com/) to identify and filter out images that are |
|
paired with text deemed insulting, obscene, hateful or otherwise toxic. |
|
* **Text personal information filtering:** We filtered certain personal information and other sensitive data using [Cloud Data Loss Prevention (DLP) |
|
API](https://cloud.google.com/security/products/dlp) to protect the privacy |
|
of individuals. Identifiers such as social security numbers and [other sensitive information types] were removed. |
|
* **Additional methods:** Filtering based on content quality and safety in |
|
line with our policies and practices. |
|
|
|
[other sensitive information types]: https://cloud.google.com/sensitive-data-protection/docs/high-sensitivity-infotypes-reference?_gl=1*jg604m*_ga*ODk5MzA3ODQyLjE3MTAzMzQ3NTk.*_ga_WH2QY8WWF5*MTcxMDUxNTkxMS4yLjEuMTcxMDUxNjA2NC4wLjAuMA..&_ga=2.172110058.-899307842.1710334759 |
|
|
|
|
|
|
|
## How to Use |
|
|
|
PaliGemma is a single-turn vision language model not meant for conversational use, |
|
and it works best when fine-tuning to a specific use case. |
|
|
|
You can configure which task the model will solve by conditioning it with task prefixes, |
|
such as “detect” or “segment”. The pretrained models were trained in this fashion to imbue |
|
them with a rich set of capabilities (question answering, captioning, segmentation, etc.). |
|
However, they are not designed to be used directly, but to be transferred (by fine-tuning) |
|
to specific tasks using a similar prompt structure. For interactive testing, you can use |
|
the "mix" family of models, which have been fine-tuned on a mixture of tasks. To see model |
|
[google/paligemma-3b-mix-448](https://huggingface.co/google/paligemma-3b-mix-448) in action, |
|
check [this Space that uses the Transformers codebase](https://huggingface.co/spaces/big-vision/paligemma-hf). |
|
|
|
Please, refer to the [usage and limitations section](#usage-and-limitations) for intended |
|
use cases, or visit the [blog post](https://huggingface.co/blog/paligemma-google-vlm) for |
|
additional details and examples. |
|
|
|
## Use in Transformers |
|
|
|
The following snippets use model `google/paligemma-3b-mix-224` for reference purposes. |
|
The model in this repo you are now browsing may have been trained for other tasks, please |
|
make sure you use appropriate inputs for the task at hand. |
|
|
|
### Running the default precision (`float32`) on CPU |
|
|
|
```python |
|
from transformers import AutoProcessor, PaliGemmaForConditionalGeneration |
|
from PIL import Image |
|
import requests |
|
import torch |
|
|
|
model_id = "google/paligemma-3b-mix-224" |
|
|
|
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true" |
|
image = Image.open(requests.get(url, stream=True).raw) |
|
|
|
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id).eval() |
|
processor = AutoProcessor.from_pretrained(model_id) |
|
|
|
# Instruct the model to create a caption in Spanish |
|
prompt = "caption es" |
|
model_inputs = processor(text=prompt, images=image, return_tensors="pt") |
|
input_len = model_inputs["input_ids"].shape[-1] |
|
|
|
with torch.inference_mode(): |
|
generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False) |
|
generation = generation[0][input_len:] |
|
decoded = processor.decode(generation, skip_special_tokens=True) |
|
print(decoded) |
|
``` |
|
|
|
Output: `Un auto azul estacionado frente a un edificio.` |
|
|
|
### Running other precisions on CUDA |
|
|
|
For convenience, the repos contain revisions of the weights already converted to `bfloat16` and `float16`, |
|
so you can use them to reduce the download size and avoid casting on your local computer. |
|
|
|
This is how you'd run `bfloat16` on an nvidia CUDA card. |
|
|
|
```python |
|
from transformers import AutoProcessor, PaliGemmaForConditionalGeneration |
|
from PIL import Image |
|
import requests |
|
import torch |
|
|
|
model_id = "google/paligemma-3b-mix-224" |
|
device = "cuda:0" |
|
dtype = torch.bfloat16 |
|
|
|
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true" |
|
image = Image.open(requests.get(url, stream=True).raw) |
|
|
|
model = PaliGemmaForConditionalGeneration.from_pretrained( |
|
model_id, |
|
torch_dtype=dtype, |
|
device_map=device, |
|
revision="bfloat16", |
|
).eval() |
|
processor = AutoProcessor.from_pretrained(model_id) |
|
|
|
# Instruct the model to create a caption in Spanish |
|
prompt = "caption es" |
|
model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device) |
|
input_len = model_inputs["input_ids"].shape[-1] |
|
|
|
with torch.inference_mode(): |
|
generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False) |
|
generation = generation[0][input_len:] |
|
decoded = processor.decode(generation, skip_special_tokens=True) |
|
print(decoded) |
|
``` |
|
|
|
### Loading in 4-bit / 8-bit |
|
|
|
You need to install `bitsandbytes` to automatically run inference using 8-bit or 4-bit precision: |
|
|
|
``` |
|
pip install bitsandbytes accelerate |
|
``` |
|
|
|
``` |
|
from transformers import AutoProcessor, PaliGemmaForConditionalGeneration |
|
from PIL import Image |
|
import requests |
|
import torch |
|
|
|
model_id = "google/paligemma-3b-mix-224" |
|
device = "cuda:0" |
|
dtype = torch.bfloat16 |
|
|
|
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true" |
|
image = Image.open(requests.get(url, stream=True).raw) |
|
|
|
quantization_config = BitsAndBytesConfig(load_in_8bit=True) |
|
|
|
model = PaliGemmaForConditionalGeneration.from_pretrained( |
|
model_id, quantization_config=quantization_config |
|
).eval() |
|
processor = AutoProcessor.from_pretrained(model_id) |
|
|
|
# Instruct the model to create a caption in Spanish |
|
prompt = "caption es" |
|
model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device) |
|
input_len = model_inputs["input_ids"].shape[-1] |
|
|
|
with torch.inference_mode(): |
|
generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False) |
|
generation = generation[0][input_len:] |
|
decoded = processor.decode(generation, skip_special_tokens=True) |
|
print(decoded) |
|
``` |
|
|
|
## Implementation information |
|
|
|
### Hardware |
|
|
|
PaliGemma was trained using the latest generation of Tensor Processing Unit |
|
(TPU) hardware (TPUv5e). |
|
|
|
### Software |
|
|
|
Training was done using [JAX](https://github.com/google/jax), |
|
[Flax](https://github.com/google/flax), |
|
[TFDS](https://github.com/tensorflow/datasets) and |
|
[`big_vision`](https://github.com/google-research/big_vision). |
|
|
|
JAX allows researchers to take advantage of the latest generation of hardware, |
|
including TPUs, for faster and more efficient training of large models. |
|
|
|
TFDS is used to access datasets and Flax is used for model architecture. The |
|
PaliGemma fine-tune code and inference code are released in the `big_vision` |
|
GitHub repository. |
|
|
|
## Evaluation information |
|
|
|
### Benchmark results |
|
|
|
In order to verify the transferability of PaliGemma to a wide variety of |
|
academic tasks, we fine-tune the pretrained models on each task. Additionally we |
|
train the mix model with a mixture of the transfer tasks. We report results on |
|
different resolutions to provide an impression of which tasks benefit from |
|
increased resolution. Importantly, none of these tasks or datasets are part of |
|
the pretraining data mixture, and their images are explicitly removed from the |
|
web-scale pre-training data. |
|
|
|
#### Single task (fine-tune on single task) |
|
|
|
<table> |
|
<tbody><tr> |
|
<th>Benchmark<br>(train split)</th> |
|
<th>Metric<br>(split)</th> |
|
<th>pt-224</th> |
|
<th>pt-448</th> |
|
<th>pt-896</th> |
|
</tr> |
|
<tr> |
|
<th>Captioning</th> |
|
</tr> |
|
<tr> |
|
<td> |
|
<a href="https://cocodataset.org/#home">COCO captions</a><br>(train+restval) |
|
</td> |
|
<td>CIDEr (val)</td> |
|
<td>141.92</td> |
|
<td>144.60</td> |
|
</tr> |
|
<tr> |
|
<td> |
|
<a href="https://nocaps.org/">NoCaps</a><br>(Eval of COCO<br>captions transfer) |
|
</td> |
|
<td>CIDEr (val)</td> |
|
<td>121.72</td> |
|
<td>123.58</td> |
|
</tr> |
|
<tr> |
|
<td> |
|
<a href="https://arxiv.org/pdf/2205.12522">COCO-35L</a><br>(train) |
|
</td> |
|
<td>CIDEr dev<br>(en/avg-34/avg)</td> |
|
<td> |
|
139.2<br> |
|
115.8<br> |
|
116.4 |
|
</td> |
|
<td> |
|
141.2<br> |
|
118.0<br> |
|
118.6 |
|
</td> |
|
</tr> |
|
<tr> |
|
<td> |
|
<a href="https://arxiv.org/pdf/2205.12522">XM3600</a><br>(Eval of COCO-35L transfer) |
|
</td> |
|
<td>CIDEr dev<br>(en/avg-34/avg)</td> |
|
<td> |
|
78.1<br> |
|
41.3<br> |
|
42.4 |
|
</td> |
|
<td> |
|
80.0<br> |
|
41.9<br> |
|
42.9 |
|
</td> |
|
</tr> |
|
<tr> |
|
<td> |
|
<a href="https://textvqa.org/textcaps/">TextCaps</a><br>(train) |
|
</td> |
|
<td>CIDEr (val)</td> |
|
<td>127.48</td> |
|
<td>153.94</td> |
|
</tr> |
|
<tr> |
|
<td> |
|
<a href="https://arxiv.org/abs/2110.11624">SciCap</a><br>(first sentence, no subfigure)<br>(train+val) |
|
</td> |
|
<td>CIDEr/BLEU-4<br>(test)</td> |
|
<td> |
|
162.25<br> |
|
0.192<br> |
|
</td> |
|
<td> |
|
181.49<br> |
|
0.211<br> |
|
</td> |
|
</tr> |
|
<tr> |
|
<td> |
|
<a href="https://arxiv.org/abs/2108.03353">Screen2words</a><br>(train+dev) |
|
</td> |
|
<td>CIDEr (test)</td> |
|
<td>117.57</td> |
|
<td>119.59</td> |
|
</tr> |
|
<tr> |
|
<td> |
|
<a href="https://arxiv.org/abs/2010.04295">Widget Captioning</a><br>(train+dev) |
|
</td> |
|
<td>CIDEr (test)</td> |
|
<td>136.07</td> |
|
<td>148.36</td> |
|
</tr> |
|
<tr> |
|
<th>Question answering</th> |
|
</tr> |
|
<tr> |
|
<td> |
|
<a href="https://visualqa.org/index.html">VQAv2</a><br>(train+validation) |
|
</td> |
|
<td>Accuracy<br>(Test server - std)</td> |
|
<td>83.19</td> |
|
<td>85.64</td> |
|
</tr> |
|
<tr> |
|
<td> |
|
<a href="https://arxiv.org/abs/2401.06209">MMVP</a><br>(Eval of VQAv2 transfer) |
|
</td> |
|
<td>Paired Accuracy</td> |
|
<td>47.33</td> |
|
<td>45.33</td> |
|
</tr> |
|
<tr> |
|
<td> |
|
<a href="https://arxiv.org/abs/2305.10355">POPE</a><br>(Eval of VQAv2 transfer) |
|
</td> |
|
<td>Accuracy<br>(random/popular/<br>adversarial)</td> |
|
<td> |
|
87.80<br> |
|
85.87<br> |
|
84.27 |
|
</td> |
|
<td> |
|
88.23<br> |
|
86.77<br> |
|
85.90 |
|
</td> |
|
</tr> |
|
<tr> |
|
<td> |
|
<a href="https://okvqa.allenai.org/">OKVQA</a><br>(train) |
|
</td> |
|
<td>Accuracy (val)</td> |
|
<td>63.54</td> |
|
<td>63.15</td> |
|
</tr> |
|
<tr> |
|
<td> |
|
<a href="https://allenai.org/project/a-okvqa/home">A-OKVQA</a> (MC)<br>(train+val) |
|
</td> |
|
<td>Accuracy<br>(Test server)</td> |
|
<td>76.37</td> |
|
<td>76.90</td> |
|
</tr> |
|
<tr> |
|
<td> |
|
<a href="https://allenai.org/project/a-okvqa/home">A-OKVQA</a> (DA)<br>(train+val) |
|
</td> |
|
<td>Accuracy<br>(Test server)</td> |
|
<td>61.85</td> |
|
<td>63.22</td> |
|
</tr> |
|
<tr> |
|
<td> |
|
<a href="https://cs.stanford.edu/people/dorarad/gqa/about.html">GQA</a><br>(train_balanced+<br>val_balanced) |
|
</td> |
|
<td>Accuracy<br>(testdev balanced)</td> |
|
<td>65.61</td> |
|
<td>67.03</td> |
|
</tr> |
|
<tr> |
|
<td> |
|
<a href="https://aclanthology.org/2022.findings-acl.196/">xGQA</a><br>(Eval of GQA transfer) |
|
</td> |
|
<td>Mean Accuracy<br>(bn, de, en, id,<br>ko, pt, ru, zh)</td> |
|
<td>58.37</td> |
|
<td>59.07</td> |
|
</tr> |
|
<tr> |
|
<td> |
|
<a href="https://lil.nlp.cornell.edu/nlvr/">NLVR2</a><br>(train+dev) |
|
</td> |
|
<td>Accuracy (test)</td> |
|
<td>90.02</td> |
|
<td>88.93</td> |
|
</tr> |
|
<tr> |
|
<td> |
|
<a href="https://marvl-challenge.github.io/">MaRVL</a><br>(Eval of NLVR2 transfer) |
|
</td> |
|
<td>Mean Accuracy<br>(test)<br>(id, sw, ta, tr, zh)</td> |
|
<td>80.57</td> |
|
<td>76.78</td> |
|
</tr> |
|
<tr> |
|
<td> |
|
<a href="https://allenai.org/data/diagrams">AI2D</a><br>(train) |
|
</td> |
|
<td>Accuracy (test)</td> |
|
<td>72.12</td> |
|
<td>73.28</td> |
|
</tr> |
|
<tr> |
|
<td> |
|
<a href="https://scienceqa.github.io/">ScienceQA</a><br>(Img subset, no CoT)<br>(train+val) |
|
</td> |
|
<td>Accuracy (test)</td> |
|
<td>95.39</td> |
|
<td>95.93</td> |
|
</tr> |
|
<tr> |
|
<td> |
|
<a href="https://zenodo.org/records/6344334">RSVQA-LR</a> (Non numeric)<br>(train+val) |
|
</td> |
|
<td>Mean Accuracy<br>(test)</td> |
|
<td>92.65</td> |
|
<td>93.11</td> |
|
</tr> |
|
<tr> |
|
<td> |
|
<a href="https://zenodo.org/records/6344367">RSVQA-HR</a> (Non numeric)<br>(train+val) |
|
</td> |
|
<td>Mean Accuracy<br>(test/test2)</td> |
|
<td> |
|
92.61<br> |
|
90.58 |
|
</td> |
|
<td> |
|
92.79<br> |
|
90.54 |
|
</td> |
|
</tr> |
|
<tr> |
|
<td> |
|
<a href="https://arxiv.org/abs/2203.10244">ChartQA</a><br>(human+aug)x(train+val) |
|
</td> |
|
<td>Mean Relaxed<br>Accuracy<br>(test_human,<br>test_aug)</td> |
|
<td>57.08</td> |
|
<td>71.36</td> |
|
</tr> |
|
<tr> |
|
<td> |
|
<a href="https://vizwiz.org/tasks-and-datasets/vqa/">VizWiz VQA</a><br>(train+val) |
|
</td> |
|
<td>Accuracy<br>(Test server - std)</td> |
|
<td> |
|
73.7 |
|
</td> |
|
<td> |
|
75.52 |
|
</td> |
|
</tr> |
|
<tr> |
|
<td> |
|
<a href="https://arxiv.org/abs/1810.12440">TallyQA</a><br>(train) |
|
</td> |
|
<td>Accuracy<br>(test_simple/<br>test_complex)</td> |
|
<td> |
|
81.72<br> |
|
69.56 |
|
</td> |
|
<td> |
|
84.86<br> |
|
72.27 |
|
</td> |
|
</tr> |
|
<tr> |
|
<td> |
|
<a href="https://ocr-vqa.github.io/">OCR-VQA</a><br>(train+val) |
|
</td> |
|
<td>Accuracy (test)</td> |
|
<td>72.32</td> |
|
<td>74.61</td> |
|
<td>74.93</td> |
|
</tr> |
|
<tr> |
|
<td> |
|
<a href="https://textvqa.org/">TextVQA</a><br>(train+val) |
|
</td> |
|
<td>Accuracy<br>(Test server - std)</td> |
|
<td>55.47</td> |
|
<td>73.15</td> |
|
<td>76.48</td> |
|
</tr> |
|
<tr> |
|
<td> |
|
<a href="https://www.docvqa.org/">DocVQA</a><br>(train+val) |
|
</td> |
|
<td>ANLS (Test server)</td> |
|
<td>43.74</td> |
|
<td>78.02</td> |
|
<td>84.77</td> |
|
</tr> |
|
<tr> |
|
<td> |
|
<a href="https://openaccess.thecvf.com/content/WACV2022/papers/Mathew_InfographicVQA_WACV_2022_paper.pdf">Infographic VQA</a><br>(train+val) |
|
</td> |
|
<td>ANLS (Test server)</td> |
|
<td>28.46</td> |
|
<td>40.47</td> |
|
<td>47.75</td> |
|
</tr> |
|
<tr> |
|
<td> |
|
<a href="https://arxiv.org/abs/1905.13648">SceneText VQA</a><br>(train+val) |
|
</td> |
|
<td>ANLS (Test server)</td> |
|
<td>63.29</td> |
|
<td>81.82</td> |
|
<td>84.40</td> |
|
</tr> |
|
<tr> |
|
<th>Segmentation</th> |
|
</tr> |
|
<tr> |
|
<td> |
|
<a href="https://arxiv.org/abs/1608.00272">RefCOCO</a><br>(combined refcoco, refcoco+,<br>refcocog excluding val<br>and test images) |
|
</td> |
|
<td>MIoU<br>(validation)<br>refcoco/refcoco+/<br>refcocog</td> |
|
<td> |
|
73.40<br> |
|
68.32<br> |
|
67.65 |
|
</td> |
|
<td> |
|
75.57<br> |
|
69.76<br> |
|
70.17 |
|
</td> |
|
<td> |
|
76.94<br> |
|
72.18<br> |
|
72.22 |
|
</td> |
|
</tr> |
|
<tr> |
|
<th>Video tasks (Caption/QA)</th> |
|
</tr> |
|
<tr> |
|
<td>MSR-VTT (Captioning)</td> |
|
<td>CIDEr (test)</td> |
|
<td>70.54</td> |
|
</tr> |
|
<tr> |
|
<td>MSR-VTT (QA)</td> |
|
<td>Accuracy (test)</td> |
|
<td>50.09</td> |
|
</tr> |
|
<tr> |
|
<td>ActivityNet (Captioning)</td> |
|
<td>CIDEr (test)</td> |
|
<td>34.62</td> |
|
</tr> |
|
<tr> |
|
<td>ActivityNet (QA)</td> |
|
<td>Accuracy (test)</td> |
|
<td>50.78</td> |
|
</tr> |
|
<tr> |
|
<td>VATEX (Captioning)</td> |
|
<td>CIDEr (test)</td> |
|
<td>79.73</td> |
|
</tr> |
|
<tr> |
|
<td>MSVD (QA)</td> |
|
<td>Accuracy (test)</td> |
|
<td>60.22</td> |
|
</tr> |
|
</tbody></table> |
|
|
|
#### Mix model (fine-tune on mixture of transfer tasks) |
|
|
|
<table> |
|
<tbody><tr> |
|
<th>Benchmark</th> |
|
<th>Metric (split)</th> |
|
<th>mix-224</th> |
|
<th>mix-448</th> |
|
</tr> |
|
<tr> |
|
<td><a href="https://arxiv.org/abs/2401.06209">MMVP</a></td> |
|
<td>Paired Accuracy</td> |
|
<td>46.00</td> |
|
<td>45.33</td> |
|
</tr> |
|
<tr> |
|
<td><a href="https://arxiv.org/abs/2305.10355">POPE</a></td> |
|
<td>Accuracy<br>(random/popular/adversarial)</td> |
|
<td> |
|
88.00<br> |
|
86.63<br> |
|
85.67 |
|
</td> |
|
<td> |
|
89.37<br> |
|
88.40<br> |
|
87.47 |
|
</td> |
|
</tr> |
|
</tbody></table> |
|
|
|
## Ethics and safety |
|
|
|
### Evaluation approach |
|
|
|
Our evaluation methods include structured evaluations and internal red-teaming |
|
testing of relevant content policies. Red-teaming was conducted by a number of |
|
different teams, each with different goals and human evaluation metrics. These |
|
models were evaluated against a number of different categories relevant to |
|
ethics and safety, including: |
|
|
|
* Human evaluation on prompts covering child safety, content safety and |
|
representational harms. See the [Gemma model |
|
card](https://ai.google.dev/gemma/docs/model_card#evaluation_approach) for |
|
more details on evaluation approach, but with image captioning and visual |
|
question answering setups. |
|
* Image-to-Text benchmark evaluation: Benchmark against relevant academic |
|
datasets such as FairFace Dataset ([Karkkainen et al., |
|
2021](https://arxiv.org/abs/1908.04913)). |
|
|
|
### Evaluation results |
|
|
|
* The human evaluation results of ethics and safety evaluations are within |
|
acceptable thresholds for meeting [internal |
|
policies](https://storage.googleapis.com/gweb-uniblog-publish-prod/documents/2023_Google_AI_Principles_Progress_Update.pdf#page=11) |
|
for categories such as child safety, content safety and representational |
|
harms. |
|
* On top of robust internal evaluations, we also use the Perspective API |
|
(threshold of 0.8) to measure toxicity, profanity, and other potential |
|
issues in the generated captions for images sourced from the FairFace |
|
dataset. We report the maximum and median values observed across subgroups |
|
for each of the perceived gender, ethnicity, and age attributes. |
|
|
|
|
|
<table> |
|
<tbody><tr> |
|
</tr></tbody><tbody><tr><th>Metric</th> |
|
<th>Perceived<br>gender</th> |
|
<th></th> |
|
<th>Ethnicity</th> |
|
<th></th> |
|
<th>Age group</th> |
|
<th></th> |
|
</tr> |
|
<tr> |
|
<th></th> |
|
<th>Maximum</th> |
|
<th>Median</th> |
|
<th>Maximum</th> |
|
<th>Median</th> |
|
<th>Maximum</th> |
|
<th>Median</th> |
|
</tr> |
|
<tr> |
|
<td>Toxicity</td> |
|
<td>0.04%</td> |
|
<td>0.03%</td> |
|
<td>0.08%</td> |
|
<td>0.00%</td> |
|
<td>0.09%</td> |
|
<td>0.00%</td> |
|
</tr> |
|
<tr> |
|
<td>Identity Attack</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
</tr> |
|
<tr> |
|
<td>Insult</td> |
|
<td>0.06%</td> |
|
<td>0.04%</td> |
|
<td>0.09%</td> |
|
<td>0.07%</td> |
|
<td>0.16%</td> |
|
<td>0.00%</td> |
|
</tr> |
|
<tr> |
|
<td>Threat</td> |
|
<td>0.06%</td> |
|
<td>0.05%</td> |
|
<td>0.14%</td> |
|
<td>0.05%</td> |
|
<td>0.17%</td> |
|
<td>0.00%</td> |
|
</tr> |
|
<tr> |
|
<td>Profanity</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
<td>0.00%</td> |
|
</tr> |
|
</tbody></table> |
|
|
|
## Usage and limitations |
|
|
|
### Intended usage |
|
|
|
Open Vision Language Models (VLMs) have a wide range of applications across |
|
various industries and domains. The following list of potential uses is not |
|
comprehensive. The purpose of this list is to provide contextual information |
|
about the possible use-cases that the model creators considered as part of model |
|
training and development. |
|
|
|
Fine-tune on specific vision-language task: |
|
|
|
* The pre-trained models can be fine-tuned on a wide range of vision-language |
|
tasks such as: image captioning, short video caption, visual question |
|
answering, text reading, object detection and object segmentation. |
|
* The pre-trained models can be fine-tuned for specific domains such as remote |
|
sensing question answering, visual questions from people who are blind, |
|
science question answering, describe UI element functionalities. |
|
* The pre-trained models can be fine-tuned for tasks with non-textual outputs |
|
such as bounding boxes or segmentation masks. |
|
|
|
Vision-language research: |
|
|
|
* The pre-trained models and fine-tuned models can serve as a foundation for researchers to experiment with VLM |
|
techniques, develop algorithms, and contribute to the advancement of the |
|
field. |
|
|
|
### Ethical considerations and risks |
|
|
|
The development of vision-language models (VLMs) raises several ethical concerns. In creating an open model, we have carefully considered the following: |
|
|
|
* Bias and Fairness |
|
* VLMs trained on large-scale, real-world image-text data can reflect socio-cultural biases embedded in the training material. These models underwent careful scrutiny, input data pre-processing described and posterior evaluations reported in this card. |
|
* Misinformation and Misuse |
|
* VLMs can be misused to generate text that is false, misleading, or harmful. |
|
* Guidelines are provided for responsible use with the model, see the [Responsible Generative AI Toolkit](https://ai.google.dev/responsible). |
|
* Transparency and Accountability |
|
* This model card summarizes details on the models' architecture, capabilities, limitations, and evaluation processes. |
|
* A responsibly developed open model offers the opportunity to share innovation by making VLM technology accessible to developers and researchers across the AI ecosystem. |
|
|
|
|
|
Risks identified and mitigations: |
|
|
|
* **Perpetuation of biases:** It's encouraged to perform continuous monitoring |
|
(using evaluation metrics, human review) and the exploration of de-biasing |
|
techniques during model training, fine-tuning, and other use cases. |
|
* **Generation of harmful content:** Mechanisms and guidelines for content |
|
safety are essential. Developers are encouraged to exercise caution and |
|
implement appropriate content safety safeguards based on their specific |
|
product policies and application use cases. |
|
* **Misuse for malicious purposes:** Technical limitations and developer and |
|
end-user education can help mitigate against malicious applications of LLMs. |
|
Educational resources and reporting mechanisms for users to flag misuse are |
|
provided. Prohibited uses of Gemma models are outlined in the [Gemma |
|
Prohibited Use Policy](https://ai.google.dev/gemma/prohibited_use_policy). |
|
* **Privacy violations:** Models were trained on data filtered to remove certain personal information and sensitive data. Developers are encouraged to adhere to privacy regulations with privacy-preserving techniques. |
|
|
|
### Limitations |
|
|
|
* Most limitations inherited from the underlying Gemma model still apply: |
|
* VLMs are better at tasks that can be framed with clear prompts and |
|
instructions. Open-ended or highly complex tasks might be challenging. |
|
* Natural language is inherently complex. VLMs might struggle to grasp |
|
subtle nuances, sarcasm, or figurative language. |
|
* VLMs generate responses based on information they learned from their |
|
training datasets, but they are not knowledge bases. They may generate |
|
incorrect or outdated factual statements. |
|
* VLMs rely on statistical patterns in language and images. They might |
|
lack the ability to apply common sense reasoning in certain situations. |
|
* PaliGemma was designed first and foremost to serve as a general pre-trained |
|
model for transfer to specialized tasks. Hence, its "out of the box" or |
|
"zero-shot" performance might lag behind models designed specifically for |
|
that. |
|
* PaliGemma is not a multi-turn chatbot. It is designed for a single round of |
|
image and text input. |