--- pipeline_tag: image-to-text tags: - image-captioning languages: - en license: bsd-3-clause widget: - src: >- https://huggingface.co/datasets/mishig/sample_images/resolve/main/savanna.jpg example_title: Savanna - src: >- https://huggingface.co/datasets/mishig/sample_images/resolve/main/football-match.jpg example_title: Football Match - src: >- https://huggingface.co/datasets/mishig/sample_images/resolve/main/airport.jpg example_title: Airport datasets: - unography/laion-14k-GPT4V-LIVIS-Captions inference: parameters: max_length: 300 --- # LongCap: Finetuned [BLIP](https://huggingface.co/Salesforce/blip-image-captioning-base) for generating long captions of images, suitable for prompts for text-to-image generation and captioning text-to-image datasets ## Usage You can use this model for conditional and un-conditional image captioning ### Using the Pytorch model #### Running the model on CPU
Click to expand ```python import requests from PIL import Image from transformers import BlipProcessor, BlipForConditionalGeneration processor = BlipProcessor.from_pretrained("unography/blip-long-cap") model = BlipForConditionalGeneration.from_pretrained("unography/blip-long-cap") img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB') inputs = processor(raw_image, return_tensors="pt") pixel_values = inputs.pixel_values out = model.generate(pixel_values=pixel_values, max_length=250) print(processor.decode(out[0], skip_special_tokens=True)) >>> a beach setting with a woman kneeling down and interacting with a dog. the woman is wearing a collar and is standing near the dog. the dog is positioned on the sand, and the atmosphere is calm and relaxing. there are no other people or animals in the image. ```
#### Running the model on GPU ##### In full precision
Click to expand ```python import requests from PIL import Image from transformers import BlipProcessor, BlipForConditionalGeneration processor = BlipProcessor.from_pretrained("unography/blip-long-cap") model = BlipForConditionalGeneration.from_pretrained("unography/blip-long-cap").to("cuda") img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB') inputs = processor(raw_image, return_tensors="pt").to("cuda") pixel_values = inputs.pixel_values out = model.generate(pixel_values=pixel_values, max_length=250) print(processor.decode(out[0], skip_special_tokens=True)) >>> a beach setting with a woman kneeling down and interacting with a dog. the woman is wearing a collar and is standing near the dog. the dog is positioned on the sand, and the atmosphere is calm and relaxing. there are no other people or animals in the image. ```
##### In half precision (`float16`)
Click to expand ```python import torch import requests from PIL import Image from transformers import BlipProcessor, BlipForConditionalGeneration processor = BlipProcessor.from_pretrained("unography/blip-long-cap") model = BlipForConditionalGeneration.from_pretrained("unography/blip-long-cap", torch_dtype=torch.float16).to("cuda") img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB') inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16) pixel_values = inputs.pixel_values out = model.generate(pixel_values=pixel_values, max_length=250) print(processor.decode(out[0], skip_special_tokens=True)) >>> a beach setting with a woman kneeling down and interacting with a dog. the woman is wearing a collar and is standing near the dog. the dog is positioned on the sand, and the atmosphere is calm and relaxing. there are no other people or animals in the image. ```