--- library_name: transformers license: gemma datasets: - damerajee/hindi_VQA language: - en - hi pipeline_tag: image-text-to-text --- # Model Details This model was fine-tuned from the base-model [google/paligemma-3b-pt-224](https://huggingface.co/google/paligemma-3b-pt-224) to handle input images and bilingual (Hindi and English) text sequences for both input and output. # Training Details * Model Configuration: Fine-tuned on two epoch using a V100 gpu. * Training Duration: half a day. * Evaluation Loss: Achieved an eval loss of 1.32 at the end of the epoch. # Dataset **To find out more about the dataset please head** to [damerajee/hindi_VQA](https://huggingface.co/datasets/damerajee/hindi_VQA) # Example output | Image | Example | |-------|---------| | |**Describe this image in great details answer in hindi**
चित्र में एक व्यक्ति को हरे-भरा मैदान के बीचोबीच, पानी का सिर पर छाता पकड़े हुए देखा गया है। उनका शरीर ऊपर उठा हुआ दिखाई देता था जो उन्हें नीचे की ओर खींचती हुई प्रकृति से बचाया जा रहा होता होगा ।जमीन नदी जैसा तालाब या झील जैसे आकार और रंगों द्वारा बनाई गई थी. ये पेड़ अपनी गहरी शाखाओं वाला संतुलित ढेर हैं क्योंकि वे इमारatsकी मरम्मत करते समय भी अपना स्थान संरक्षित कर रहे थे - कुछ बाहर निकले गए तो अन्य भागे रहते ही वापस फेंके जाते देख सकते होंगे! पृष्ठष्ठीय प्रकाश व्यवस्था ने दृश्य समग्र रूप प्रदान किया है: नीली चमत्कारी आसमान जिससे बादलों तक पहुंचने लगकर भूखंजन बन जाता सूर्य वर्गी| ||**What is the scenery of this image?**
Here we can see an animation of a forest. In the foreground there are rocks, trees and plants on ground with green leaves. Behind it's looking like buildings or houses which is made up of bricks at top right side corner of picture where light colored something seems happening over here such as lighting effect might be present under these objects where tree trunk appears yellow color from bottom towards middle its looks redish orange colour while other branches appear yellowish brown all around them after that.| # Dataset **To find out more about the dataset please check this out [damerajee/hin_vqa](https://huggingface.co/datasets/damerajee/hindi_VQA)** # How to Use ```python !pip install peft trl datasets accelerate bitsandbytes !pip install transformers --upgrade ``` ### To Run the model on a single T4 GPU on Float16 ```python from transformers import PaliGemmaForConditionalGeneration , AutoProcessor import torch import requests from PIL import Image from io import BytesIO # URL of the image url = "https://huggingface.co/Tensoic/Cerule-v0.1/resolve/main/examples/astronaut.png" # Use requests to get the image response = requests.get(url) # Use BytesIO to convert the response content to a file-like object image_file = BytesIO(response.content) # Use PIL to open the image file image = Image.open(image_file) text = "What is this image about answer in hindi " device_index = torch.cuda.current_device() print("device_index:",device_index) base_model = PaliGemmaForConditionalGeneration.from_pretrained("BhashaAI/Paligemma-hindi-chat-v1.0",device_map={"": device_index},torch_dtype=torch.float16,low_cpu_mem_usage=True) processor = AutoProcessor.from_pretrained("BhashaAI/Paligemma-hindi-chat-v1.0") inputs = processor(text=text, images=image, return_tensors="pt").to("cuda") for k,v in inputs.items(): print(k,v.shape) MAX_LENGTH = 200 # Autoregressively generate # We use greedy decoding here, for more fancy methods see https://huggingface.co/blog/how-to-generate generated_ids = base_model.generate(**inputs, max_new_tokens=MAX_LENGTH,temperature=0.7,repetition_penalty=2.0,do_sample=True) # Next we turn each predicted token ID back into a string using the decode method # We chop of the prompt, which consists of image tokens and our text prompt image_token_index = base_model.config.image_token_index num_image_tokens = len(generated_ids[generated_ids==image_token_index]) num_text_tokens = len(processor.tokenizer.encode(text)) num_prompt_tokens = num_image_tokens + num_text_tokens + 2 generated_text = processor.batch_decode(generated_ids[:, num_prompt_tokens:], skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] generated_text ``` ### To Run the model on a single T4 GPU in 4Bits ```python from transformers import PaliGemmaForConditionalGeneration , AutoProcessor,BitsAndBytesConfig import torch from datasets import load_dataset import requests from PIL import Image from io import BytesIO # URL of the image url = "https://huggingface.co/Tensoic/Cerule-v0.1/resolve/main/examples/mario.png" # Use requests to get the image response = requests.get(url) # Use BytesIO to convert the response content to a file-like object image_file = BytesIO(response.content) # Use PIL to open the image file image = Image.open(image_file) text = "Describe this image and tell me about these character" device_index = torch.cuda.current_device() print("device_index:",device_index) quantization_config = BitsAndBytesConfig(load_in_4bit=True) base_model = PaliGemmaForConditionalGeneration.from_pretrained("BhashaAI/Paligemma-hindi-chat-v1.0",device_map={"": device_index},quantization_config=quantization_config,torch_dtype=torch.float16,low_cpu_mem_usage=True) processor = AutoProcessor.from_pretrained("BhashaAI/Paligemma-hindi-chat-v1.0") inputs = processor(text=text, images=image, return_tensors="pt").to("cuda") for k,v in inputs.items(): print(k,v.shape) MAX_LENGTH = 200 # Autoregressively generate # We use greedy decoding here, for more fancy methods see https://huggingface.co/blog/how-to-generate generated_ids = base_model.generate(**inputs, max_new_tokens=MAX_LENGTH,temperature=0.7,repetition_penalty=2.0,do_sample=True) # Next we turn each predicted token ID back into a string using the decode method # We chop of the prompt, which consists of image tokens and our text prompt image_token_index = base_model.config.image_token_index num_image_tokens = len(generated_ids[generated_ids==image_token_index]) num_text_tokens = len(processor.tokenizer.encode(text)) num_prompt_tokens = num_image_tokens + num_text_tokens + 2 generated_text = processor.batch_decode(generated_ids[:, num_prompt_tokens:], skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] generated_text ```