pcuenq HF staff commited on
Commit
99f3b66
1 Parent(s): 9beb8bd

Metadata, transformers usage examples (#1)

Browse files

- Metadata, transformers usage examples (5141ea04ae6c3ef5614b39e191e554e28174472b)

Files changed (1) hide show
  1. README.md +800 -156
README.md CHANGED
@@ -1,199 +1,843 @@
1
  ---
 
2
  library_name: transformers
3
- tags: []
 
 
 
 
 
4
  ---
 
5
 
6
- # Model Card for Model ID
7
 
8
- <!-- Provide a quick summary of what the model is/does. -->
9
 
 
10
 
 
 
 
11
 
12
- ## Model Details
13
 
14
- ### Model Description
15
 
16
- <!-- Provide a longer summary of what this model is. -->
17
 
18
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
 
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
 
28
- ### Model Sources [optional]
 
 
 
 
29
 
30
- <!-- Provide the basic links for the model. -->
31
 
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
 
 
 
 
 
35
 
36
- ## Uses
37
 
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 
 
 
 
39
 
40
- ### Direct Use
41
 
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
 
44
- [More Information Needed]
45
 
46
- ### Downstream Use [optional]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
 
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
 
50
- [More Information Needed]
51
 
52
- ### Out-of-Scope Use
 
53
 
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
 
 
 
 
 
 
 
 
 
 
 
 
55
 
56
- [More Information Needed]
57
 
58
- ## Bias, Risks, and Limitations
59
 
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
 
62
- [More Information Needed]
63
 
64
- ### Recommendations
 
65
 
66
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
 
 
 
 
 
67
 
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
 
 
69
 
70
- ## How to Get Started with the Model
71
 
72
- Use the code below to get started with the model.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73
 
74
- [More Information Needed]
 
 
 
 
 
75
 
76
- ## Training Details
77
 
78
- ### Training Data
79
 
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
 
81
 
82
- [More Information Needed]
83
 
84
- ### Training Procedure
 
 
 
 
85
 
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
 
 
87
 
88
- #### Preprocessing [optional]
 
89
 
90
- [More Information Needed]
91
-
92
-
93
- #### Training Hyperparameters
94
-
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
-
97
- #### Speeds, Sizes, Times [optional]
98
-
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
-
101
- [More Information Needed]
102
-
103
- ## Evaluation
104
-
105
- <!-- This section describes the evaluation protocols and provides the results. -->
106
-
107
- ### Testing Data, Factors & Metrics
108
-
109
- #### Testing Data
110
-
111
- <!-- This should link to a Dataset Card if possible. -->
112
-
113
- [More Information Needed]
114
-
115
- #### Factors
116
-
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
-
119
- [More Information Needed]
120
-
121
- #### Metrics
122
-
123
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
-
125
- [More Information Needed]
126
-
127
- ### Results
128
-
129
- [More Information Needed]
130
-
131
- #### Summary
132
-
133
-
134
-
135
- ## Model Examination [optional]
136
-
137
- <!-- Relevant interpretability work for the model goes here -->
138
-
139
- [More Information Needed]
140
-
141
- ## Environmental Impact
142
-
143
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
-
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
-
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
152
-
153
- ## Technical Specifications [optional]
154
-
155
- ### Model Architecture and Objective
156
-
157
- [More Information Needed]
158
-
159
- ### Compute Infrastructure
160
-
161
- [More Information Needed]
162
-
163
- #### Hardware
164
-
165
- [More Information Needed]
166
-
167
- #### Software
168
-
169
- [More Information Needed]
170
-
171
- ## Citation [optional]
172
-
173
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
-
175
- **BibTeX:**
176
-
177
- [More Information Needed]
178
-
179
- **APA:**
180
-
181
- [More Information Needed]
182
-
183
- ## Glossary [optional]
184
-
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
-
187
- [More Information Needed]
188
-
189
- ## More Information [optional]
190
-
191
- [More Information Needed]
192
-
193
- ## Model Card Authors [optional]
194
-
195
- [More Information Needed]
196
-
197
- ## Model Card Contact
198
-
199
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: gemma
3
  library_name: transformers
4
+ extra_gated_heading: Access PaliGemma on Hugging Face
5
+ extra_gated_prompt: To access PaliGemma on Hugging Face, you’re required to review
6
+ and agree to Google’s usage license. To do this, please ensure you’re logged-in
7
+ to Hugging Face and click below. Requests are processed immediately.
8
+ extra_gated_button_content: Acknowledge license
9
+ pipeline_tag: image-text-to-text
10
  ---
11
+ # PaliGemma model card
12
 
13
+ **Model page:** [PaliGemma](https://ai.google.dev/gemma/docs/paligemma)
14
 
15
+ Transformers PaliGemma 3B weights, pre-trained with 224*224 input images and 128 token input/output text sequences. The models are available in float32, bfloat16 and float16 formats for fine-tuning.
16
 
17
+ **Resources and technical documentation:**
18
 
19
+ * [Responsible Generative AI Toolkit](https://ai.google.dev/responsible)
20
+ * [PaliGemma on Kaggle](https://www.kaggle.com/models/google/paligemma)
21
+ * [PaliGemma on Vertex Model Garden](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/363)
22
 
23
+ **Terms of Use:** [Terms](https://ai.google.dev/gemma/terms)
24
 
25
+ **Authors:** Google
26
 
27
+ ## Model information
28
 
29
+ ### Model summary
30
 
31
+ #### Description
 
 
 
 
 
 
32
 
33
+ PaliGemma is a versatile and lightweight vision-language model (VLM) inspired by
34
+ [PaLI-3](https://arxiv.org/abs/2310.09199) and based on open components such as
35
+ the [SigLIP vision model](https://arxiv.org/abs/2303.15343) and the [Gemma
36
+ language model](https://arxiv.org/abs/2403.08295). It takes both image and text
37
+ as input and generates text as output, supporting multiple languages. It is designed for class-leading fine-tune performance on a wide range of vision-language tasks such as image and short video caption, visual question answering, text reading, object detection and object segmentation.
38
 
39
+ #### Model architecture
40
 
41
+ PaliGemma is the composition of a [Transformer
42
+ decoder](https://arxiv.org/abs/1706.03762) and a [Vision Transformer image
43
+ encoder](https://arxiv.org/abs/2010.11929), with a total of 3 billion
44
+ params. The text decoder is initialized from
45
+ [Gemma-2B](https://www.kaggle.com/models/google/gemma). The image encoder is
46
+ initialized from
47
+ [SigLIP-So400m/14](https://colab.research.google.com/github/google-research/big_vision/blob/main/big_vision/configs/proj/image_text/SigLIP_demo.ipynb).
48
+ PaliGemma is trained following the PaLI-3 recipes.
49
 
50
+ #### Inputs and outputs
51
 
52
+ * **Input:** Image and text string, such as a prompt to caption the image, or
53
+ a question.
54
+ * **Output:** Generated text in response to the input, such as a caption of
55
+ the image, an answer to a question, a list of object bounding box
56
+ coordinates, or segmentation codewords.
57
 
58
+ ### Model data
59
 
60
+ #### Pre-train datasets
61
 
62
+ PaliGemma is pre-trained on the following mixture of datasets:
63
 
64
+ * **WebLI:** [WebLI (Web Language Image)](https://arxiv.org/abs/2209.06794) is
65
+ a web-scale multilingual image-text dataset built from the public web. A
66
+ wide range of WebLI splits are used to acquire versatile model capabilities,
67
+ such as visual semantic understanding, object localization,
68
+ visually-situated text understanding, multilinguality, etc.
69
+ * **CC3M-35L:** Curated English image-alt_text pairs from webpages ([Sharma et
70
+ al., 2018](https://aclanthology.org/P18-1238/)). We used the [Google Cloud
71
+ Translation API](https://cloud.google.com/translate) to translate into 34
72
+ additional languages.
73
+ * **VQ²A-CC3M-35L/VQG-CC3M-35L:** A subset of VQ2A-CC3M ([Changpinyo et al.,
74
+ 2022a](https://aclanthology.org/2022.naacl-main.142/)), translated into the
75
+ same additional 34 languages as CC3M-35L, using the [Google Cloud
76
+ Translation API](https://cloud.google.com/translate).
77
+ * **OpenImages:** Detection and object-aware questions and answers
78
+ ([Piergiovanni et al. 2022](https://arxiv.org/abs/2209.04372)) generated by
79
+ handcrafted rules on the [OpenImages dataset].
80
+ * **WIT:** Images and texts collected from Wikipedia ([Srinivasan et al.,
81
+ 2021](https://arxiv.org/abs/2103.01913)).
82
 
83
+ [OpenImages dataset]: https://storage.googleapis.com/openimages/web/factsfigures_v7.html
84
 
85
+ #### Data responsibility filtering
86
 
87
+ The following filters are applied to WebLI, with the goal of training PaliGemma
88
+ on clean data:
89
 
90
+ * **Pornographic image filtering:** This filter removes images deemed to be of
91
+ pornographic nature.
92
+ * **Text safety filtering:** We identify and filter out images that are paired
93
+ with unsafe text. Unsafe text is any text deemed to contain or be about
94
+ CSAI, pornography, vulgarities, or otherwise offensive.
95
+ * **Text toxicity filtering:** We further use the [Perspective
96
+ API](https://perspectiveapi.com/) to identify and filter out images that are
97
+ paired with text deemed insulting, obscene, hateful or otherwise toxic.
98
+ * **Text personal information filtering:** We filtered certain personal information and other sensitive data using [Cloud Data Loss Prevention (DLP)
99
+ API](https://cloud.google.com/security/products/dlp) to protect the privacy
100
+ of individuals. Identifiers such as social security numbers and [other sensitive information types] were removed.
101
+ * **Additional methods:** Filtering based on content quality and safety in
102
+ line with our policies and practices.
103
 
104
+ [other sensitive information types]: https://cloud.google.com/sensitive-data-protection/docs/high-sensitivity-infotypes-reference?_gl=1*jg604m*_ga*ODk5MzA3ODQyLjE3MTAzMzQ3NTk.*_ga_WH2QY8WWF5*MTcxMDUxNTkxMS4yLjEuMTcxMDUxNjA2NC4wLjAuMA..&_ga=2.172110058.-899307842.1710334759
105
 
 
106
 
 
107
 
108
+ ## How to Use
109
 
110
+ PaliGemma is a single-turn vision language model not meant for conversational use,
111
+ and it works best when fine-tuning to a specific use case.
112
 
113
+ You can configure which task the model will solve by conditioning it with task prefixes,
114
+ such as “detect” or “segment”. The pretrained models were trained in this fashion to imbue
115
+ them with a rich set of capabilities (question answering, captioning, segmentation, etc.).
116
+ However, they are not designed to be used directly, but to be transferred (by fine-tuning)
117
+ to specific tasks using a similar prompt structure. For interactive testing, you can use
118
+ the "mix" family of models, which have been fine-tuned on a mixture of tasks.
119
 
120
+ Please, refer to the [usage and limitations section](#usage-and-limitations) for intended
121
+ use cases, or visit the [blog post](https://huggingface.co/blog/paligemma-google-vlm) for
122
+ additional details and examples.
123
 
124
+ ## Use in Transformers
125
 
126
+ The following snippets use model `google/paligemma-3b-mix-224` for reference purposes.
127
+ The model in this repo you are now browsing may have been trained for other tasks, please
128
+ make sure you use appropriate inputs for the task at hand.
129
+
130
+ ### Running the default precision (`float32`) on CPU
131
+
132
+ ```python
133
+ from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
134
+ from PIL import Image
135
+ import requests
136
+ import torch
137
+
138
+ model_id = "google/paligemma-3b-mix-224"
139
+
140
+ url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
141
+ image = Image.open(requests.get(url, stream=True).raw)
142
+
143
+ model = PaliGemmaForConditionalGeneration.from_pretrained(model_id).eval()
144
+ processor = AutoProcessor.from_pretrained(model_id)
145
+
146
+ # Instruct the model to create a caption in Spanish
147
+ prompt = "caption es"
148
+ model_inputs = processor(text=prompt, images=image, return_tensors="pt")
149
+ input_len = model_inputs["input_ids"].shape[-1]
150
 
151
+ with torch.inference_mode():
152
+ generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
153
+ generation = generation[0][input_len:]
154
+ decoded = processor.decode(generation, skip_special_tokens=True)
155
+ print(decoded)
156
+ ```
157
 
158
+ Output: `Un auto azul estacionado frente a un edificio.`
159
 
160
+ ### Running other precisions on CUDA
161
 
162
+ For convenience, the repos contain revisions of the weights already converted to `bfloat16` and `float16`,
163
+ so you can use them to reduce the download size and avoid casting on your local computer.
164
 
165
+ This is how you'd run `bfloat16` on an nvidia CUDA card.
166
 
167
+ ```python
168
+ from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
169
+ from PIL import Image
170
+ import requests
171
+ import torch
172
 
173
+ model_id = "google/paligemma-3b-mix-224"
174
+ device = "cuda:0"
175
+ dtype = torch.bfloat16
176
 
177
+ url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
178
+ image = Image.open(requests.get(url, stream=True).raw)
179
 
180
+ model = PaliGemmaForConditionalGeneration.from_pretrained(
181
+ model_id,
182
+ torch_dtype=dtype,
183
+ device_map=device,
184
+ revision="bfloat16",
185
+ ).eval()
186
+ processor = AutoProcessor.from_pretrained(model_id)
187
+
188
+ # Instruct the model to create a caption in Spanish
189
+ prompt = "caption es"
190
+ model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
191
+ input_len = model_inputs["input_ids"].shape[-1]
192
+
193
+ with torch.inference_mode():
194
+ generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
195
+ generation = generation[0][input_len:]
196
+ decoded = processor.decode(generation, skip_special_tokens=True)
197
+ print(decoded)
198
+ ```
199
+
200
+ ### Loading in 4-bit / 8-bit
201
+
202
+ You need to install `bitsandbytes` to automatically run inference using 8-bit or 4-bit precision:
203
+
204
+ ```
205
+ pip install bitsandbytes accelerate
206
+ ```
207
+
208
+ ```
209
+ from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
210
+ from PIL import Image
211
+ import requests
212
+ import torch
213
+
214
+ model_id = "google/paligemma-3b-mix-224"
215
+ device = "cuda:0"
216
+ dtype = torch.bfloat16
217
+
218
+ url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
219
+ image = Image.open(requests.get(url, stream=True).raw)
220
+
221
+ quantization_config = BitsAndBytesConfig(load_in_8bit=True)
222
+
223
+ model = PaliGemmaForConditionalGeneration.from_pretrained(
224
+ model_id, quantization_config=quantization_config
225
+ ).eval()
226
+ processor = AutoProcessor.from_pretrained(model_id)
227
+
228
+ # Instruct the model to create a caption in Spanish
229
+ prompt = "caption es"
230
+ model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
231
+ input_len = model_inputs["input_ids"].shape[-1]
232
+
233
+ with torch.inference_mode():
234
+ generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
235
+ generation = generation[0][input_len:]
236
+ decoded = processor.decode(generation, skip_special_tokens=True)
237
+ print(decoded)
238
+ ```
239
+
240
+ ## Implementation information
241
+
242
+ ### Hardware
243
+
244
+ PaliGemma was trained using the latest generation of Tensor Processing Unit
245
+ (TPU) hardware (TPUv5e).
246
+
247
+ ### Software
248
+
249
+ Training was done using [JAX](https://github.com/google/jax),
250
+ [Flax](https://github.com/google/flax),
251
+ [TFDS](https://github.com/tensorflow/datasets) and
252
+ [`big_vision`](https://github.com/google-research/big_vision).
253
+
254
+ JAX allows researchers to take advantage of the latest generation of hardware,
255
+ including TPUs, for faster and more efficient training of large models.
256
+
257
+ TFDS is used to access datasets and Flax is used for model architecture. The
258
+ PaliGemma fine-tune code and inference code are released in the `big_vision`
259
+ GitHub repository.
260
+
261
+ ## Evaluation information
262
+
263
+ ### Benchmark results
264
+
265
+ In order to verify the transferability of PaliGemma to a wide variety of
266
+ academic tasks, we fine-tune the pretrained models on each task. Additionally we
267
+ train the mix model with a mixture of the transfer tasks. We report results on
268
+ different resolutions to provide an impression of which tasks benefit from
269
+ increased resolution. Importantly, none of these tasks or datasets are part of
270
+ the pretraining data mixture, and their images are explicitly removed from the
271
+ web-scale pre-training data.
272
+
273
+ #### Single task (fine-tune on single task)
274
+
275
+ <table>
276
+ <tbody><tr>
277
+ <th>Benchmark<br>(train split)</th>
278
+ <th>Metric<br>(split)</th>
279
+ <th>pt-224</th>
280
+ <th>pt-448</th>
281
+ <th>pt-896</th>
282
+ </tr>
283
+ <tr>
284
+ <th>Captioning</th>
285
+ </tr>
286
+ <tr>
287
+ <td>
288
+ <a href="https://cocodataset.org/#home">COCO captions</a><br>(train+restval)
289
+ </td>
290
+ <td>CIDEr (val)</td>
291
+ <td>141.92</td>
292
+ <td>144.60</td>
293
+ </tr>
294
+ <tr>
295
+ <td>
296
+ <a href="https://nocaps.org/">NoCaps</a><br>(Eval of COCO<br>captions transfer)
297
+ </td>
298
+ <td>CIDEr (val)</td>
299
+ <td>121.72</td>
300
+ <td>123.58</td>
301
+ </tr>
302
+ <tr>
303
+ <td>
304
+ <a href="https://arxiv.org/pdf/2205.12522">COCO-35L</a><br>(train)
305
+ </td>
306
+ <td>CIDEr dev<br>(en/avg-34/avg)</td>
307
+ <td>
308
+ 139.2<br>
309
+ 115.8<br>
310
+ 116.4
311
+ </td>
312
+ <td>
313
+ 141.2<br>
314
+ 118.0<br>
315
+ 118.6
316
+ </td>
317
+ </tr>
318
+ <tr>
319
+ <td>
320
+ <a href="https://arxiv.org/pdf/2205.12522">XM3600</a><br>(Eval of COCO-35L transfer)
321
+ </td>
322
+ <td>CIDEr dev<br>(en/avg-34/avg)</td>
323
+ <td>
324
+ 78.1<br>
325
+ 41.3<br>
326
+ 42.4
327
+ </td>
328
+ <td>
329
+ 80.0<br>
330
+ 41.9<br>
331
+ 42.9
332
+ </td>
333
+ </tr>
334
+ <tr>
335
+ <td>
336
+ <a href="https://textvqa.org/textcaps/">TextCaps</a><br>(train)
337
+ </td>
338
+ <td>CIDEr (val)</td>
339
+ <td>127.48</td>
340
+ <td>153.94</td>
341
+ </tr>
342
+ <tr>
343
+ <td>
344
+ <a href="https://arxiv.org/abs/2110.11624">SciCap</a><br>(first sentence, no subfigure)<br>(train+val)
345
+ </td>
346
+ <td>CIDEr/BLEU-4<br>(test)</td>
347
+ <td>
348
+ 162.25<br>
349
+ 0.192<br>
350
+ </td>
351
+ <td>
352
+ 181.49<br>
353
+ 0.211<br>
354
+ </td>
355
+ </tr>
356
+ <tr>
357
+ <td>
358
+ <a href="https://arxiv.org/abs/2108.03353">Screen2words</a><br>(train+dev)
359
+ </td>
360
+ <td>CIDEr (test)</td>
361
+ <td>117.57</td>
362
+ <td>119.59</td>
363
+ </tr>
364
+ <tr>
365
+ <td>
366
+ <a href="https://arxiv.org/abs/2010.04295">Widget Captioning</a><br>(train+dev)
367
+ </td>
368
+ <td>CIDEr (test)</td>
369
+ <td>136.07</td>
370
+ <td>148.36</td>
371
+ </tr>
372
+ <tr>
373
+ <th>Question answering</th>
374
+ </tr>
375
+ <tr>
376
+ <td>
377
+ <a href="https://visualqa.org/index.html">VQAv2</a><br>(train+validation)
378
+ </td>
379
+ <td>Accuracy<br>(Test server - std)</td>
380
+ <td>83.19</td>
381
+ <td>85.64</td>
382
+ </tr>
383
+ <tr>
384
+ <td>
385
+ <a href="https://arxiv.org/abs/2401.06209">MMVP</a><br>(Eval of VQAv2 transfer)
386
+ </td>
387
+ <td>Paired Accuracy</td>
388
+ <td>47.33</td>
389
+ <td>45.33</td>
390
+ </tr>
391
+ <tr>
392
+ <td>
393
+ <a href="https://arxiv.org/abs/2305.10355">POPE</a><br>(Eval of VQAv2 transfer)
394
+ </td>
395
+ <td>Accuracy<br>(random/popular/<br>adversarial)</td>
396
+ <td>
397
+ 87.80<br>
398
+ 85.87<br>
399
+ 84.27
400
+ </td>
401
+ <td>
402
+ 88.23<br>
403
+ 86.77<br>
404
+ 85.90
405
+ </td>
406
+ </tr>
407
+ <tr>
408
+ <td>
409
+ <a href="https://okvqa.allenai.org/">OKVQA</a><br>(train)
410
+ </td>
411
+ <td>Accuracy (val)</td>
412
+ <td>63.54</td>
413
+ <td>63.15</td>
414
+ </tr>
415
+ <tr>
416
+ <td>
417
+ <a href="https://allenai.org/project/a-okvqa/home">A-OKVQA</a> (MC)<br>(train+val)
418
+ </td>
419
+ <td>Accuracy<br>(Test server)</td>
420
+ <td>76.37</td>
421
+ <td>76.90</td>
422
+ </tr>
423
+ <tr>
424
+ <td>
425
+ <a href="https://allenai.org/project/a-okvqa/home">A-OKVQA</a> (DA)<br>(train+val)
426
+ </td>
427
+ <td>Accuracy<br>(Test server)</td>
428
+ <td>61.85</td>
429
+ <td>63.22</td>
430
+ </tr>
431
+ <tr>
432
+ <td>
433
+ <a href="https://cs.stanford.edu/people/dorarad/gqa/about.html">GQA</a><br>(train_balanced+<br>val_balanced)
434
+ </td>
435
+ <td>Accuracy<br>(testdev balanced)</td>
436
+ <td>65.61</td>
437
+ <td>67.03</td>
438
+ </tr>
439
+ <tr>
440
+ <td>
441
+ <a href="https://aclanthology.org/2022.findings-acl.196/">xGQA</a><br>(Eval of GQA transfer)
442
+ </td>
443
+ <td>Mean Accuracy<br>(bn, de, en, id,<br>ko, pt, ru, zh)</td>
444
+ <td>58.37</td>
445
+ <td>59.07</td>
446
+ </tr>
447
+ <tr>
448
+ <td>
449
+ <a href="https://lil.nlp.cornell.edu/nlvr/">NLVR2</a><br>(train+dev)
450
+ </td>
451
+ <td>Accuracy (test)</td>
452
+ <td>90.02</td>
453
+ <td>88.93</td>
454
+ </tr>
455
+ <tr>
456
+ <td>
457
+ <a href="https://marvl-challenge.github.io/">MaRVL</a><br>(Eval of NLVR2 transfer)
458
+ </td>
459
+ <td>Mean Accuracy<br>(test)<br>(id, sw, ta, tr, zh)</td>
460
+ <td>80.57</td>
461
+ <td>76.78</td>
462
+ </tr>
463
+ <tr>
464
+ <td>
465
+ <a href="https://allenai.org/data/diagrams">AI2D</a><br>(train)
466
+ </td>
467
+ <td>Accuracy (test)</td>
468
+ <td>72.12</td>
469
+ <td>73.28</td>
470
+ </tr>
471
+ <tr>
472
+ <td>
473
+ <a href="https://scienceqa.github.io/">ScienceQA</a><br>(Img subset, no CoT)<br>(train+val)
474
+ </td>
475
+ <td>Accuracy (test)</td>
476
+ <td>95.39</td>
477
+ <td>95.93</td>
478
+ </tr>
479
+ <tr>
480
+ <td>
481
+ <a href="https://zenodo.org/records/6344334">RSVQA-LR</a> (Non numeric)<br>(train+val)
482
+ </td>
483
+ <td>Mean Accuracy<br>(test)</td>
484
+ <td>92.65</td>
485
+ <td>93.11</td>
486
+ </tr>
487
+ <tr>
488
+ <td>
489
+ <a href="https://zenodo.org/records/6344367">RSVQA-HR</a> (Non numeric)<br>(train+val)
490
+ </td>
491
+ <td>Mean Accuracy<br>(test/test2)</td>
492
+ <td>
493
+ 92.61<br>
494
+ 90.58
495
+ </td>
496
+ <td>
497
+ 92.79<br>
498
+ 90.54
499
+ </td>
500
+ </tr>
501
+ <tr>
502
+ <td>
503
+ <a href="https://arxiv.org/abs/2203.10244">ChartQA</a><br>(human+aug)x(train+val)
504
+ </td>
505
+ <td>Mean Relaxed<br>Accuracy<br>(test_human,<br>test_aug)</td>
506
+ <td>57.08</td>
507
+ <td>71.36</td>
508
+ </tr>
509
+ <tr>
510
+ <td>
511
+ <a href="https://vizwiz.org/tasks-and-datasets/vqa/">VizWiz VQA</a><br>(train+val)
512
+ </td>
513
+ <td>Accuracy<br>(Test server - std)</td>
514
+ <td>
515
+ 73.7
516
+ </td>
517
+ <td>
518
+ 75.52
519
+ </td>
520
+ </tr>
521
+ <tr>
522
+ <td>
523
+ <a href="https://arxiv.org/abs/1810.12440">TallyQA</a><br>(train)
524
+ </td>
525
+ <td>Accuracy<br>(test_simple/<br>test_complex)</td>
526
+ <td>
527
+ 81.72<br>
528
+ 69.56
529
+ </td>
530
+ <td>
531
+ 84.86<br>
532
+ 72.27
533
+ </td>
534
+ </tr>
535
+ <tr>
536
+ <td>
537
+ <a href="https://ocr-vqa.github.io/">OCR-VQA</a><br>(train+val)
538
+ </td>
539
+ <td>Accuracy (test)</td>
540
+ <td>72.32</td>
541
+ <td>74.61</td>
542
+ <td>74.93</td>
543
+ </tr>
544
+ <tr>
545
+ <td>
546
+ <a href="https://textvqa.org/">TextVQA</a><br>(train+val)
547
+ </td>
548
+ <td>Accuracy<br>(Test server - std)</td>
549
+ <td>55.47</td>
550
+ <td>73.15</td>
551
+ <td>76.48</td>
552
+ </tr>
553
+ <tr>
554
+ <td>
555
+ <a href="https://www.docvqa.org/">DocVQA</a><br>(train+val)
556
+ </td>
557
+ <td>ANLS (Test server)</td>
558
+ <td>43.74</td>
559
+ <td>78.02</td>
560
+ <td>84.77</td>
561
+ </tr>
562
+ <tr>
563
+ <td>
564
+ <a href="https://openaccess.thecvf.com/content/WACV2022/papers/Mathew_InfographicVQA_WACV_2022_paper.pdf">Infographic VQA</a><br>(train+val)
565
+ </td>
566
+ <td>ANLS (Test server)</td>
567
+ <td>28.46</td>
568
+ <td>40.47</td>
569
+ <td>47.75</td>
570
+ </tr>
571
+ <tr>
572
+ <td>
573
+ <a href="https://arxiv.org/abs/1905.13648">SceneText VQA</a><br>(train+val)
574
+ </td>
575
+ <td>ANLS (Test server)</td>
576
+ <td>63.29</td>
577
+ <td>81.82</td>
578
+ <td>84.40</td>
579
+ </tr>
580
+ <tr>
581
+ <th>Segmentation</th>
582
+ </tr>
583
+ <tr>
584
+ <td>
585
+ <a href="https://arxiv.org/abs/1608.00272">RefCOCO</a><br>(combined refcoco, refcoco+,<br>refcocog excluding val<br>and test images)
586
+ </td>
587
+ <td>MIoU<br>(validation)<br>refcoco/refcoco+/<br>refcocog</td>
588
+ <td>
589
+ 73.40<br>
590
+ 68.32<br>
591
+ 67.65
592
+ </td>
593
+ <td>
594
+ 75.57<br>
595
+ 69.76<br>
596
+ 70.17
597
+ </td>
598
+ <td>
599
+ 76.94<br>
600
+ 72.18<br>
601
+ 72.22
602
+ </td>
603
+ </tr>
604
+ <tr>
605
+ <th>Video tasks (Caption/QA)</th>
606
+ </tr>
607
+ <tr>
608
+ <td>MSR-VTT (Captioning)</td>
609
+ <td>CIDEr (test)</td>
610
+ <td>70.54</td>
611
+ </tr>
612
+ <tr>
613
+ <td>MSR-VTT (QA)</td>
614
+ <td>Accuracy (test)</td>
615
+ <td>50.09</td>
616
+ </tr>
617
+ <tr>
618
+ <td>ActivityNet (Captioning)</td>
619
+ <td>CIDEr (test)</td>
620
+ <td>34.62</td>
621
+ </tr>
622
+ <tr>
623
+ <td>ActivityNet (QA)</td>
624
+ <td>Accuracy (test)</td>
625
+ <td>50.78</td>
626
+ </tr>
627
+ <tr>
628
+ <td>VATEX (Captioning)</td>
629
+ <td>CIDEr (test)</td>
630
+ <td>79.73</td>
631
+ </tr>
632
+ <tr>
633
+ <td>MSVD (QA)</td>
634
+ <td>Accuracy (test)</td>
635
+ <td>60.22</td>
636
+ </tr>
637
+ </tbody></table>
638
+
639
+ #### Mix model (fine-tune on mixture of transfer tasks)
640
+
641
+ <table>
642
+ <tbody><tr>
643
+ <th>Benchmark</th>
644
+ <th>Metric (split)</th>
645
+ <th>mix-224</th>
646
+ <th>mix-448</th>
647
+ </tr>
648
+ <tr>
649
+ <td><a href="https://arxiv.org/abs/2401.06209">MMVP</a></td>
650
+ <td>Paired Accuracy</td>
651
+ <td>46.00</td>
652
+ <td>45.33</td>
653
+ </tr>
654
+ <tr>
655
+ <td><a href="https://arxiv.org/abs/2305.10355">POPE</a></td>
656
+ <td>Accuracy<br>(random/popular/adversarial)</td>
657
+ <td>
658
+ 88.00<br>
659
+ 86.63<br>
660
+ 85.67
661
+ </td>
662
+ <td>
663
+ 89.37<br>
664
+ 88.40<br>
665
+ 87.47
666
+ </td>
667
+ </tr>
668
+ </tbody></table>
669
+
670
+ ## Ethics and safety
671
+
672
+ ### Evaluation approach
673
+
674
+ Our evaluation methods include structured evaluations and internal red-teaming
675
+ testing of relevant content policies. Red-teaming was conducted by a number of
676
+ different teams, each with different goals and human evaluation metrics. These
677
+ models were evaluated against a number of different categories relevant to
678
+ ethics and safety, including:
679
+
680
+ * Human evaluation on prompts covering child safety, content safety and
681
+ representational harms. See the [Gemma model
682
+ card](https://ai.google.dev/gemma/docs/model_card#evaluation_approach) for
683
+ more details on evaluation approach, but with image captioning and visual
684
+ question answering setups.
685
+ * Image-to-Text benchmark evaluation: Benchmark against relevant academic
686
+ datasets such as FairFace Dataset ([Karkkainen et al.,
687
+ 2021](https://arxiv.org/abs/1908.04913)).
688
+
689
+ ### Evaluation results
690
+
691
+ * The human evaluation results of ethics and safety evaluations are within
692
+ acceptable thresholds for meeting [internal
693
+ policies](https://storage.googleapis.com/gweb-uniblog-publish-prod/documents/2023_Google_AI_Principles_Progress_Update.pdf#page=11)
694
+ for categories such as child safety, content safety and representational
695
+ harms.
696
+ * On top of robust internal evaluations, we also use the Perspective API
697
+ (threshold of 0.8) to measure toxicity, profanity, and other potential
698
+ issues in the generated captions for images sourced from the FairFace
699
+ dataset. We report the maximum and median values observed across subgroups
700
+ for each of the perceived gender, ethnicity, and age attributes.
701
+
702
+
703
+ <table>
704
+ <tbody><tr>
705
+ </tr></tbody><tbody><tr><th>Metric</th>
706
+ <th>Perceived<br>gender</th>
707
+ <th></th>
708
+ <th>Ethnicity</th>
709
+ <th></th>
710
+ <th>Age group</th>
711
+ <th></th>
712
+ </tr>
713
+ <tr>
714
+ <th></th>
715
+ <th>Maximum</th>
716
+ <th>Median</th>
717
+ <th>Maximum</th>
718
+ <th>Median</th>
719
+ <th>Maximum</th>
720
+ <th>Median</th>
721
+ </tr>
722
+ <tr>
723
+ <td>Toxicity</td>
724
+ <td>0.04%</td>
725
+ <td>0.03%</td>
726
+ <td>0.08%</td>
727
+ <td>0.00%</td>
728
+ <td>0.09%</td>
729
+ <td>0.00%</td>
730
+ </tr>
731
+ <tr>
732
+ <td>Identity Attack</td>
733
+ <td>0.00%</td>
734
+ <td>0.00%</td>
735
+ <td>0.00%</td>
736
+ <td>0.00%</td>
737
+ <td>0.00%</td>
738
+ <td>0.00%</td>
739
+ </tr>
740
+ <tr>
741
+ <td>Insult</td>
742
+ <td>0.06%</td>
743
+ <td>0.04%</td>
744
+ <td>0.09%</td>
745
+ <td>0.07%</td>
746
+ <td>0.16%</td>
747
+ <td>0.00%</td>
748
+ </tr>
749
+ <tr>
750
+ <td>Threat</td>
751
+ <td>0.06%</td>
752
+ <td>0.05%</td>
753
+ <td>0.14%</td>
754
+ <td>0.05%</td>
755
+ <td>0.17%</td>
756
+ <td>0.00%</td>
757
+ </tr>
758
+ <tr>
759
+ <td>Profanity</td>
760
+ <td>0.00%</td>
761
+ <td>0.00%</td>
762
+ <td>0.00%</td>
763
+ <td>0.00%</td>
764
+ <td>0.00%</td>
765
+ <td>0.00%</td>
766
+ </tr>
767
+ </tbody></table>
768
+
769
+ ## Usage and limitations
770
+
771
+ ### Intended usage
772
+
773
+ Open Vision Language Models (VLMs) have a wide range of applications across
774
+ various industries and domains. The following list of potential uses is not
775
+ comprehensive. The purpose of this list is to provide contextual information
776
+ about the possible use-cases that the model creators considered as part of model
777
+ training and development.
778
+
779
+ Fine-tune on specific vision-language task:
780
+
781
+ * The pre-trained models can be fine-tuned on a wide range of vision-language
782
+ tasks such as: image captioning, short video caption, visual question
783
+ answering, text reading, object detection and object segmentation.
784
+ * The pre-trained models can be fine-tuned for specific domains such as remote
785
+ sensing question answering, visual questions from people who are blind,
786
+ science question answering, describe UI element functionalities.
787
+ * The pre-trained models can be fine-tuned for tasks with non-textual outputs
788
+ such as bounding boxes or segmentation masks.
789
+
790
+ Vision-language research:
791
+
792
+ * The pre-trained models and fine-tuned models can serve as a foundation for researchers to experiment with VLM
793
+ techniques, develop algorithms, and contribute to the advancement of the
794
+ field.
795
+
796
+ ### Ethical considerations and risks
797
+
798
+ The development of vision-language models (VLMs) raises several ethical concerns. In creating an open model, we have carefully considered the following:
799
+
800
+ * Bias and Fairness
801
+ * VLMs trained on large-scale, real-world image-text data can reflect socio-cultural biases embedded in the training material. These models underwent careful scrutiny, input data pre-processing described and posterior evaluations reported in this card.
802
+ * Misinformation and Misuse
803
+ * VLMs can be misused to generate text that is false, misleading, or harmful.
804
+ * Guidelines are provided for responsible use with the model, see the [Responsible Generative AI Toolkit](https://ai.google.dev/responsible).
805
+ * Transparency and Accountability
806
+ * This model card summarizes details on the models' architecture, capabilities, limitations, and evaluation processes.
807
+ * A responsibly developed open model offers the opportunity to share innovation by making VLM technology accessible to developers and researchers across the AI ecosystem.
808
+
809
+
810
+ Risks identified and mitigations:
811
+
812
+ * **Perpetuation of biases:** It's encouraged to perform continuous monitoring
813
+ (using evaluation metrics, human review) and the exploration of de-biasing
814
+ techniques during model training, fine-tuning, and other use cases.
815
+ * **Generation of harmful content:** Mechanisms and guidelines for content
816
+ safety are essential. Developers are encouraged to exercise caution and
817
+ implement appropriate content safety safeguards based on their specific
818
+ product policies and application use cases.
819
+ * **Misuse for malicious purposes:** Technical limitations and developer and
820
+ end-user education can help mitigate against malicious applications of LLMs.
821
+ Educational resources and reporting mechanisms for users to flag misuse are
822
+ provided. Prohibited uses of Gemma models are outlined in the [Gemma
823
+ Prohibited Use Policy](https://ai.google.dev/gemma/prohibited_use_policy).
824
+ * **Privacy violations:** Models were trained on data filtered to remove certain personal information and sensitive data. Developers are encouraged to adhere to privacy regulations with privacy-preserving techniques.
825
+
826
+ ### Limitations
827
+
828
+ * Most limitations inherited from the underlying Gemma model still apply:
829
+ * VLMs are better at tasks that can be framed with clear prompts and
830
+ instructions. Open-ended or highly complex tasks might be challenging.
831
+ * Natural language is inherently complex. VLMs might struggle to grasp
832
+ subtle nuances, sarcasm, or figurative language.
833
+ * VLMs generate responses based on information they learned from their
834
+ training datasets, but they are not knowledge bases. They may generate
835
+ incorrect or outdated factual statements.
836
+ * VLMs rely on statistical patterns in language and images. They might
837
+ lack the ability to apply common sense reasoning in certain situations.
838
+ * PaliGemma was designed first and foremost to serve as a general pre-trained
839
+ model for transfer to specialized tasks. Hence, its "out of the box" or
840
+ "zero-shot" performance might lag behind models designed specifically for
841
+ that.
842
+ * PaliGemma is not a multi-turn chatbot. It is designed for a single round of
843
+ image and text input.