---
library_name: transformers
license: gemma
datasets:
- damerajee/hindi_VQA
language:
- en
- hi
pipeline_tag: image-text-to-text
---
# Model Details 
This model was  fine-tuned from the base-model [google/paligemma-3b-pt-224](https://huggingface.co/google/paligemma-3b-pt-224) to handle input images and bilingual (Hindi and English) text sequences for both input and output.


# Training Details 
* Model Configuration: Fine-tuned on two epoch using a V100 gpu.
* Training Duration:  half a day.
* Evaluation Loss: Achieved an eval loss of 1.32 at the end of the epoch.

# Dataset 
**To find out more about the dataset please head** to [damerajee/hindi_VQA](https://huggingface.co/datasets/damerajee/hindi_VQA)

  
# Example output 

| Image | Example |
|-------|---------|
|<img src="GhostOfTsushima.jpg" width="800" height="50"> |**Describe this image in great details answer in hindi**<br>चित्र में एक व्यक्ति को हरे-भरा मैदान के बीचोबीच, पानी का सिर पर छाता पकड़े हुए देखा गया है। उनका शरीर ऊपर उठा हुआ दिखाई देता था जो उन्हें नीचे की ओर खींचती हुई प्रकृति से बचाया जा रहा होता होगा ।जमीन नदी जैसा तालाब या झील जैसे आकार और रंगों द्वारा बनाई गई थी. ये पेड़ अपनी गहरी शाखाओं वाला संतुलित ढेर हैं क्योंकि वे इमारatsकी मरम्मत करते समय भी अपना स्थान संरक्षित कर रहे थे - कुछ बाहर निकले गए तो अन्य भागे रहते ही वापस फेंके जाते देख सकते होंगे! पृष्ठष्ठीय प्रकाश व्यवस्था ने दृश्य समग्र रूप प्रदान किया है: नीली चमत्कारी आसमान जिससे बादलों तक पहुंचने लगकर भूखंजन बन जाता सूर्य वर्गी|
|<img src="eldenring.jpg" width="500" height="300">|**What is the scenery of this image?** <br>Here we can see an animation of a forest. In the foreground there are rocks, trees and plants on ground with green leaves. Behind it's looking like buildings or houses which is made up of bricks at top right side corner of picture where light colored something seems happening over here such as lighting effect might be present under these objects where tree trunk appears yellow color from bottom towards middle its looks redish orange colour while other branches appear yellowish brown all around them after that.|

# Dataset 
**To find out more about the dataset please check this out [damerajee/hin_vqa](https://huggingface.co/datasets/damerajee/hindi_VQA)**
  
# How to Use 

```python
!pip install peft trl datasets accelerate bitsandbytes
!pip install transformers --upgrade
```
### To Run the model on a single T4 GPU on Float16 
```python
from transformers import  PaliGemmaForConditionalGeneration , AutoProcessor
import torch
import requests
from PIL import Image
from io import BytesIO

# URL of the image
url = "https://huggingface.co/Tensoic/Cerule-v0.1/resolve/main/examples/astronaut.png"

# Use requests to get the image
response = requests.get(url)

# Use BytesIO to convert the response content to a file-like object
image_file = BytesIO(response.content)

# Use PIL to open the image file
image = Image.open(image_file)
text = "What is this image about answer in hindi "

device_index = torch.cuda.current_device()
print("device_index:",device_index)
base_model = PaliGemmaForConditionalGeneration.from_pretrained("BhashaAI/Paligemma-hindi-chat-v1.0",device_map={"": device_index},torch_dtype=torch.float16,low_cpu_mem_usage=True)
processor = AutoProcessor.from_pretrained("BhashaAI/Paligemma-hindi-chat-v1.0")

inputs = processor(text=text, images=image, return_tensors="pt").to("cuda")
for k,v in inputs.items():
  print(k,v.shape)

MAX_LENGTH = 200
# Autoregressively generate
# We use greedy decoding here, for more fancy methods see https://huggingface.co/blog/how-to-generate
generated_ids = base_model.generate(**inputs, max_new_tokens=MAX_LENGTH,temperature=0.7,repetition_penalty=2.0,do_sample=True)

# Next we turn each predicted token ID back into a string using the decode method
# We chop of the prompt, which consists of image tokens and our text prompt
image_token_index = base_model.config.image_token_index
num_image_tokens = len(generated_ids[generated_ids==image_token_index])
num_text_tokens = len(processor.tokenizer.encode(text))
num_prompt_tokens = num_image_tokens + num_text_tokens + 2
generated_text = processor.batch_decode(generated_ids[:, num_prompt_tokens:], skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
generated_text

```
### To Run the model on a single T4 GPU in 4Bits
```python
from transformers import  PaliGemmaForConditionalGeneration , AutoProcessor,BitsAndBytesConfig
import torch
from datasets import load_dataset
import requests
from PIL import Image
from io import BytesIO

# URL of the image
url = "https://huggingface.co/Tensoic/Cerule-v0.1/resolve/main/examples/mario.png"
# Use requests to get the image
response = requests.get(url)

# Use BytesIO to convert the response content to a file-like object
image_file = BytesIO(response.content)

# Use PIL to open the image file
image = Image.open(image_file)
text = "Describe this image and tell me about these character"

device_index = torch.cuda.current_device()
print("device_index:",device_index)
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
base_model = PaliGemmaForConditionalGeneration.from_pretrained("BhashaAI/Paligemma-hindi-chat-v1.0",device_map={"": device_index},quantization_config=quantization_config,torch_dtype=torch.float16,low_cpu_mem_usage=True)
processor = AutoProcessor.from_pretrained("BhashaAI/Paligemma-hindi-chat-v1.0")

inputs = processor(text=text, images=image, return_tensors="pt").to("cuda")
for k,v in inputs.items():
  print(k,v.shape)

MAX_LENGTH = 200
# Autoregressively generate
# We use greedy decoding here, for more fancy methods see https://huggingface.co/blog/how-to-generate
generated_ids = base_model.generate(**inputs, max_new_tokens=MAX_LENGTH,temperature=0.7,repetition_penalty=2.0,do_sample=True)

# Next we turn each predicted token ID back into a string using the decode method
# We chop of the prompt, which consists of image tokens and our text prompt
image_token_index = base_model.config.image_token_index
num_image_tokens = len(generated_ids[generated_ids==image_token_index])
num_text_tokens = len(processor.tokenizer.encode(text))
num_prompt_tokens = num_image_tokens + num_text_tokens + 2
generated_text = processor.batch_decode(generated_ids[:, num_prompt_tokens:], skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
generated_text
```