musashihinck
commited on
Commit
β’
22045a9
1
Parent(s):
f8bf4da
Updating preprocessor config to LlavaProcessor.py
Browse files- README.md +12 -18
- preprocessor_config.json +1 -1
README.md
CHANGED
@@ -1,6 +1,6 @@
|
|
1 |
---
|
2 |
language:
|
3 |
-
- en
|
4 |
license_name: gemma-terms
|
5 |
license_link: https://ai.google.dev/gemma/terms
|
6 |
---
|
@@ -19,18 +19,17 @@ Preprint: [arxiv.org/abs/2404.01331](https://arxiv.org/abs/2404.01331)
|
|
19 |
|
20 |
The model has been finetuned for multimodal benchmark evaluations, but can also be used as a multimodal chatbot.
|
21 |
|
22 |
-
|
23 |
## Bias, Risks, and Limitations
|
24 |
|
25 |
This model has not been assessed for harm or biases, and should not be used for sensitive applications where it may cause harm.
|
26 |
|
27 |
-
|
28 |
## How to Get Started with the Model
|
29 |
|
30 |
Currently using `llava-gemma` requires a [modified preprocessor](https://huggingface.co/Intel/llava-gemma-2b/blob/main/processing_llavagemma.py).
|
31 |
|
32 |
-
|
33 |
|
|
|
34 |
|
35 |
```python
|
36 |
import requests
|
@@ -62,7 +61,7 @@ url = "https://www.ilankelman.org/stopsigns/australia.jpg"
|
|
62 |
image = Image.open(requests.get(url, stream=True).raw)
|
63 |
inputs = processor(text=prompt, images=image, return_tensors="pt")
|
64 |
inputs = {k: v.to('cuda') for k, v in inputs.items()}
|
65 |
-
|
66 |
# Generate
|
67 |
generate_ids = model.generate(**inputs, max_length=30)
|
68 |
output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
|
@@ -70,14 +69,10 @@ print(output)
|
|
70 |
|
71 |
```
|
72 |
|
73 |
-
|
74 |
-
|
75 |
-
|
76 |
## Training Details
|
77 |
|
78 |
The `llava-gemma-2b` model was trained on 8 Gaudi 2 accelerators.
|
79 |
|
80 |
-
|
81 |
### Training Data
|
82 |
|
83 |
The model was trained using the LLaVA-v1.5 data mixture.
|
@@ -89,14 +84,13 @@ This is listed as follows:
|
|
89 |
- 450K academic-task-oriented VQA data mixture.
|
90 |
- 40K ShareGPT data.
|
91 |
|
92 |
-
|
93 |
## Evaluation
|
94 |
|
95 |
-
| LM Backbone
|
96 |
-
|
|
97 |
-
| gemma-2b-it
|
98 |
-
| gemma-2b-it
|
99 |
-
| gemma-7b-it
|
100 |
-
| gemma-7b-it
|
101 |
-
| gemma-2b-it
|
102 |
-
| gemma-2b-it
|
|
|
1 |
---
|
2 |
language:
|
3 |
+
- en
|
4 |
license_name: gemma-terms
|
5 |
license_link: https://ai.google.dev/gemma/terms
|
6 |
---
|
|
|
19 |
|
20 |
The model has been finetuned for multimodal benchmark evaluations, but can also be used as a multimodal chatbot.
|
21 |
|
|
|
22 |
## Bias, Risks, and Limitations
|
23 |
|
24 |
This model has not been assessed for harm or biases, and should not be used for sensitive applications where it may cause harm.
|
25 |
|
|
|
26 |
## How to Get Started with the Model
|
27 |
|
28 |
Currently using `llava-gemma` requires a [modified preprocessor](https://huggingface.co/Intel/llava-gemma-2b/blob/main/processing_llavagemma.py).
|
29 |
|
30 |
+
_We are currently working on modifying the `LlavaProcessor` class to streamline usage (see [PR #30030](https://github.com/huggingface/transformers/pull/30030)), expect updates soon._
|
31 |
|
32 |
+
For current usage, see [`usage.py`](/usage.py) or the following code block:
|
33 |
|
34 |
```python
|
35 |
import requests
|
|
|
61 |
image = Image.open(requests.get(url, stream=True).raw)
|
62 |
inputs = processor(text=prompt, images=image, return_tensors="pt")
|
63 |
inputs = {k: v.to('cuda') for k, v in inputs.items()}
|
64 |
+
|
65 |
# Generate
|
66 |
generate_ids = model.generate(**inputs, max_length=30)
|
67 |
output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
|
|
|
69 |
|
70 |
```
|
71 |
|
|
|
|
|
|
|
72 |
## Training Details
|
73 |
|
74 |
The `llava-gemma-2b` model was trained on 8 Gaudi 2 accelerators.
|
75 |
|
|
|
76 |
### Training Data
|
77 |
|
78 |
The model was trained using the LLaVA-v1.5 data mixture.
|
|
|
84 |
- 450K academic-task-oriented VQA data mixture.
|
85 |
- 40K ShareGPT data.
|
86 |
|
|
|
87 |
## Evaluation
|
88 |
|
89 |
+
| LM Backbone | Vision Model | Pretrained Connector | GQA | MME cognition | MME perception | MM-Vet | POPE accuracy | POPE F1 | VQAv2 | TextVQA | ScienceQA Image | MMVP |
|
90 |
+
| ----------- | ------------ | -------------------- | ----- | ------------- | -------------- | ------ | ------------- | ------- | ----- | ------- | --------------- | ----- |
|
91 |
+
| gemma-2b-it | CLIP | Yes | 0.531 | 236.071 | 1130.492 | 17.706 | 0.850 | 0.839 | 70.65 | 28.06 | 0.564 | 0.287 |
|
92 |
+
| gemma-2b-it | CLIP | No | 0.481 | 247.857 | 934.611 | 13.119 | 0.784 | 0.762 | 61.74 | | 0.549 | 0.180 |
|
93 |
+
| gemma-7b-it | CLIP | Yes | 0.472 | 253.571 | 894.910 | 18.165 | 0.848 | 0.829 | 68.7 | | 0.625 | 0.327 |
|
94 |
+
| gemma-7b-it | CLIP | No | 0.472 | 278.214 | 857.274 | 19.083 | 0.782 | 0.734 | 65.09 | | 0.636 | 0.240 |
|
95 |
+
| gemma-2b-it | DinoV2 | Yes | 0.587 | 307.143 | 1132.970 | 19.128 | 0.853 | 0.838 | 71.37 | 12.53 | 0.555 | 0.227 |
|
96 |
+
| gemma-2b-it | DinoV2 | No | 0.501 | 308.929 | 959.351 | 14.541 | 0.793 | 0.772 | 61.65 | 11.1 | 0.568 | 0.180 |
|
preprocessor_config.json
CHANGED
@@ -36,7 +36,7 @@
|
|
36 |
0.26130258,
|
37 |
0.27577711
|
38 |
],
|
39 |
-
"processor_class": "
|
40 |
"resample": 3,
|
41 |
"rescale_factor": 0.00392156862745098,
|
42 |
"size": {
|
|
|
36 |
0.26130258,
|
37 |
0.27577711
|
38 |
],
|
39 |
+
"processor_class": "LlavaProcessor",
|
40 |
"resample": 3,
|
41 |
"rescale_factor": 0.00392156862745098,
|
42 |
"size": {
|