Edit model card

Model Details

This model was fine-tuned from the base-model google/paligemma-3b-pt-224 to handle input images and bilingual (Hindi and English) text sequences for both input and output.

Training Details

  • Model Configuration: Fine-tuned on two epoch using a V100 gpu.
  • Training Duration: half a day.
  • Evaluation Loss: Achieved an eval loss of 1.32 at the end of the epoch.

Dataset

To find out more about the dataset please head to damerajee/hindi_VQA

Example output

Image Example
Describe this image in great details answer in hindi
चित्र में एक व्यक्ति को हरे-भरा मैदान के बीचोबीच, पानी का सिर पर छाता पकड़े हुए देखा गया है। उनका शरीर ऊपर उठा हुआ दिखाई देता था जो उन्हें नीचे की ओर खींचती हुई प्रकृति से बचाया जा रहा होता होगा ।जमीन नदी जैसा तालाब या झील जैसे आकार और रंगों द्वारा बनाई गई थी. ये पेड़ अपनी गहरी शाखाओं वाला संतुलित ढेर हैं क्योंकि वे इमारatsकी मरम्मत करते समय भी अपना स्थान संरक्षित कर रहे थे - कुछ बाहर निकले गए तो अन्य भागे रहते ही वापस फेंके जाते देख सकते होंगे! पृष्ठष्ठीय प्रकाश व्यवस्था ने दृश्य समग्र रूप प्रदान किया है: नीली चमत्कारी आसमान जिससे बादलों तक पहुंचने लगकर भूखंजन बन जाता सूर्य वर्गी
What is the scenery of this image?
Here we can see an animation of a forest. In the foreground there are rocks, trees and plants on ground with green leaves. Behind it's looking like buildings or houses which is made up of bricks at top right side corner of picture where light colored something seems happening over here such as lighting effect might be present under these objects where tree trunk appears yellow color from bottom towards middle its looks redish orange colour while other branches appear yellowish brown all around them after that.

Dataset

To find out more about the dataset please check this out damerajee/hin_vqa

How to Use

!pip install peft trl datasets accelerate bitsandbytes
!pip install transformers --upgrade

To Run the model on a single T4 GPU on Float16

from transformers import  PaliGemmaForConditionalGeneration , AutoProcessor
import torch
import requests
from PIL import Image
from io import BytesIO

# URL of the image
url = "https://huggingface.co/Tensoic/Cerule-v0.1/resolve/main/examples/astronaut.png"

# Use requests to get the image
response = requests.get(url)

# Use BytesIO to convert the response content to a file-like object
image_file = BytesIO(response.content)

# Use PIL to open the image file
image = Image.open(image_file)
text = "What is this image about answer in hindi "

device_index = torch.cuda.current_device()
print("device_index:",device_index)
base_model = PaliGemmaForConditionalGeneration.from_pretrained("BhashaAI/Paligemma-hindi-chat-v1.0",device_map={"": device_index},torch_dtype=torch.float16,low_cpu_mem_usage=True)
processor = AutoProcessor.from_pretrained("BhashaAI/Paligemma-hindi-chat-v1.0")

inputs = processor(text=text, images=image, return_tensors="pt").to("cuda")
for k,v in inputs.items():
  print(k,v.shape)

MAX_LENGTH = 200
# Autoregressively generate
# We use greedy decoding here, for more fancy methods see https://huggingface.co/blog/how-to-generate
generated_ids = base_model.generate(**inputs, max_new_tokens=MAX_LENGTH,temperature=0.7,repetition_penalty=2.0,do_sample=True)

# Next we turn each predicted token ID back into a string using the decode method
# We chop of the prompt, which consists of image tokens and our text prompt
image_token_index = base_model.config.image_token_index
num_image_tokens = len(generated_ids[generated_ids==image_token_index])
num_text_tokens = len(processor.tokenizer.encode(text))
num_prompt_tokens = num_image_tokens + num_text_tokens + 2
generated_text = processor.batch_decode(generated_ids[:, num_prompt_tokens:], skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
generated_text

To Run the model on a single T4 GPU in 4Bits

from transformers import  PaliGemmaForConditionalGeneration , AutoProcessor,BitsAndBytesConfig
import torch
from datasets import load_dataset
import requests
from PIL import Image
from io import BytesIO

# URL of the image
url = "https://huggingface.co/Tensoic/Cerule-v0.1/resolve/main/examples/mario.png"
# Use requests to get the image
response = requests.get(url)

# Use BytesIO to convert the response content to a file-like object
image_file = BytesIO(response.content)

# Use PIL to open the image file
image = Image.open(image_file)
text = "Describe this image and tell me about these character"

device_index = torch.cuda.current_device()
print("device_index:",device_index)
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
base_model = PaliGemmaForConditionalGeneration.from_pretrained("BhashaAI/Paligemma-hindi-chat-v1.0",device_map={"": device_index},quantization_config=quantization_config,torch_dtype=torch.float16,low_cpu_mem_usage=True)
processor = AutoProcessor.from_pretrained("BhashaAI/Paligemma-hindi-chat-v1.0")

inputs = processor(text=text, images=image, return_tensors="pt").to("cuda")
for k,v in inputs.items():
  print(k,v.shape)

MAX_LENGTH = 200
# Autoregressively generate
# We use greedy decoding here, for more fancy methods see https://huggingface.co/blog/how-to-generate
generated_ids = base_model.generate(**inputs, max_new_tokens=MAX_LENGTH,temperature=0.7,repetition_penalty=2.0,do_sample=True)

# Next we turn each predicted token ID back into a string using the decode method
# We chop of the prompt, which consists of image tokens and our text prompt
image_token_index = base_model.config.image_token_index
num_image_tokens = len(generated_ids[generated_ids==image_token_index])
num_text_tokens = len(processor.tokenizer.encode(text))
num_prompt_tokens = num_image_tokens + num_text_tokens + 2
generated_text = processor.batch_decode(generated_ids[:, num_prompt_tokens:], skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
generated_text
Downloads last month
2
Safetensors
Model size
2.92B params
Tensor type
FP16
·
Inference API (serverless) does not yet support transformers models for this pipeline type.

Dataset used to train BhashaAI/Paligemma-hindi-chat-v1.0