Intel
/

llava-gemma-2b

@@ -1,6 +1,6 @@
 ---
 language:
-- en
 license_name: gemma-terms
 license_link: https://ai.google.dev/gemma/terms
 ---
@@ -19,18 +19,17 @@ Preprint: [arxiv.org/abs/2404.01331](https://arxiv.org/abs/2404.01331)
 The model has been finetuned for multimodal benchmark evaluations, but can also be used as a multimodal chatbot.
 ## Bias, Risks, and Limitations
 This model has not been assessed for harm or biases, and should not be used for sensitive applications where it may cause harm.
 ## How to Get Started with the Model
 Currently using `llava-gemma` requires a [modified preprocessor](https://huggingface.co/Intel/llava-gemma-2b/blob/main/processing_llavagemma.py).
-For example usage, see [`usage.py`](/usage.py) or the following code block:
 ```python
 import requests
@@ -62,7 +61,7 @@ url = "https://www.ilankelman.org/stopsigns/australia.jpg"
 image = Image.open(requests.get(url, stream=True).raw)
 inputs = processor(text=prompt, images=image, return_tensors="pt")
 inputs = {k: v.to('cuda') for k, v in inputs.items()}
 # Generate
 generate_ids = model.generate(**inputs, max_length=30)
 output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
@@ -70,14 +69,10 @@ print(output)
 ```
 ## Training Details
 The `llava-gemma-2b` model was trained on 8 Gaudi 2 accelerators.
 ### Training Data
 The model was trained using the LLaVA-v1.5 data mixture.
@@ -89,14 +84,13 @@ This is listed as follows:
 - 450K academic-task-oriented VQA data mixture.
 - 40K ShareGPT data.
 ## Evaluation
-| LM Backbone | Vision Model | Pretrained Connector | GQA   | MME cognition | MME perception | MM-Vet | POPE accuracy | POPE F1 | VQAv2 | TextVQA | ScienceQA Image | MMVP  |
-| ------------ | ------------- | --------------------- | ------ | ---------------- | ----------------- | ------- | ------------------ | ------------ | ------ | -------- | -------------------- | ------ |
-| gemma-2b-it | CLIP         | Yes                  | 0.531 | 236.071         | 1130.492         | 17.706 | 0.850             | 0.839       | 70.65 | 28.06   | 0.564               | 0.287 |
-| gemma-2b-it | CLIP         | No                   | 0.481 | 247.857         | 934.611          | 13.119 | 0.784             | 0.762       | 61.74 |         | 0.549               | 0.180 |
-| gemma-7b-it | CLIP         | Yes                  | 0.472 | 253.571         | 894.910          | 18.165 | 0.848             | 0.829       | 68.7  |         | 0.625               | 0.327 |
-| gemma-7b-it | CLIP         | No                   | 0.472 | 278.214         | 857.274          | 19.083 | 0.782             | 0.734       | 65.09 |         | 0.636               | 0.240 |
-| gemma-2b-it | DinoV2       | Yes                  | 0.587 | 307.143         | 1132.970         | 19.128 | 0.853             | 0.838       | 71.37 | 12.53   | 0.555               | 0.227 |
-| gemma-2b-it | DinoV2       | No                   | 0.501 | 308.929         | 959.351          | 14.541 | 0.793             | 0.772       | 61.65 | 11.1    | 0.568               | 0.180 |

 ---
 language:
+  - en
 license_name: gemma-terms
 license_link: https://ai.google.dev/gemma/terms
 ---
 The model has been finetuned for multimodal benchmark evaluations, but can also be used as a multimodal chatbot.
 ## Bias, Risks, and Limitations
 This model has not been assessed for harm or biases, and should not be used for sensitive applications where it may cause harm.
 ## How to Get Started with the Model
 Currently using `llava-gemma` requires a [modified preprocessor](https://huggingface.co/Intel/llava-gemma-2b/blob/main/processing_llavagemma.py).
+_We are currently working on modifying the `LlavaProcessor` class to streamline usage (see [PR #30030](https://github.com/huggingface/transformers/pull/30030)), expect updates soon._
+For current usage, see [`usage.py`](/usage.py) or the following code block:
 ```python
 import requests
 image = Image.open(requests.get(url, stream=True).raw)
 inputs = processor(text=prompt, images=image, return_tensors="pt")
 inputs = {k: v.to('cuda') for k, v in inputs.items()}
 # Generate
 generate_ids = model.generate(**inputs, max_length=30)
 output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
 ```
 ## Training Details
 The `llava-gemma-2b` model was trained on 8 Gaudi 2 accelerators.
 ### Training Data
 The model was trained using the LLaVA-v1.5 data mixture.
 - 450K academic-task-oriented VQA data mixture.
 - 40K ShareGPT data.
 ## Evaluation
+| LM Backbone | Vision Model | Pretrained Connector | GQA   | MME cognition | MME perception | MM-Vet | POPE accuracy | POPE F1 | VQAv2 | TextVQA | ScienceQA Image | MMVP  |
+| ----------- | ------------ | -------------------- | ----- | ------------- | -------------- | ------ | ------------- | ------- | ----- | ------- | --------------- | ----- |
+| gemma-2b-it | CLIP         | Yes                  | 0.531 | 236.071       | 1130.492       | 17.706 | 0.850         | 0.839   | 70.65 | 28.06   | 0.564           | 0.287 |
+| gemma-2b-it | CLIP         | No                   | 0.481 | 247.857       | 934.611        | 13.119 | 0.784         | 0.762   | 61.74 |         | 0.549           | 0.180 |
+| gemma-7b-it | CLIP         | Yes                  | 0.472 | 253.571       | 894.910        | 18.165 | 0.848         | 0.829   | 68.7  |         | 0.625           | 0.327 |
+| gemma-7b-it | CLIP         | No                   | 0.472 | 278.214       | 857.274        | 19.083 | 0.782         | 0.734   | 65.09 |         | 0.636           | 0.240 |
+| gemma-2b-it | DinoV2       | Yes                  | 0.587 | 307.143       | 1132.970       | 19.128 | 0.853         | 0.838   | 71.37 | 12.53   | 0.555           | 0.227 |
+| gemma-2b-it | DinoV2       | No                   | 0.501 | 308.929       | 959.351        | 14.541 | 0.793         | 0.772   | 61.65 | 11.1    | 0.568           | 0.180 |

preprocessor_config.json CHANGED Viewed

@@ -36,7 +36,7 @@
     0.26130258,
     0.27577711
   ],
-  "processor_class": "LlavaGemmaProcessor",
   "resample": 3,
   "rescale_factor": 0.00392156862745098,
   "size": {

     0.26130258,
     0.27577711
   ],
+  "processor_class": "LlavaProcessor",
   "resample": 3,
   "rescale_factor": 0.00392156862745098,
   "size": {