|
--- |
|
language: |
|
- en |
|
base_model: |
|
- Salesforce/blip-image-captioning-base |
|
pipeline_tag: image-to-text |
|
tags: |
|
- art |
|
license: apache-2.0 |
|
metrics: |
|
- bleu |
|
library_name: transformers |
|
datasets: |
|
- phiyodr/coco2017 |
|
--- |
|
### Fine-Tuned Image Captioning Model |
|
|
|
This is a fine-tuned version of BLIP for visual answering on retail product images. This model is finetuned on custom dataset with images from online retail platform and annotated with product description. |
|
|
|
This experimental model can be used for answering questions on product images in retail industry. Product meta data enrichment, Validation of human generated product description are some of the examples sue case. |
|
|
|
|
|
|
|
# Sample model predictions |
|
|
|
| Input Image | Prediction | |
|
|-------------------------------------------|--------------------------------| |
|
|<img src="https://cdn-uploads.huggingface.co/production/uploads/672d17c98e098bf429c83670/KTnUTaTjrIG7dUyR1aMho.png" alt="image/png" width="100" height="100" /> | kitchenaid artisann stand mixer| |
|
|<img src="https://cdn-uploads.huggingface.co/production/uploads/672d17c98e098bf429c83670/Skt_sjYxbfQu056v2C1Ym.png" width="100" height="100" /> | a bottle of milk sitting on a counter | |
|
|<img src="https://cdn-uploads.huggingface.co/production/uploads/672d17c98e098bf429c83670/Zp1OMzO4BEs7s9k3O5ij7.jpeg" alt="image/jpeg" width="100" height="100" />| dove sensitive skin lotion | |
|
|<img src="https://cdn-uploads.huggingface.co/production/uploads/672d17c98e098bf429c83670/dYNo38En0M0WpKONS8StX.jpeg" alt="bread bag" width="100" height="100" /> | bread bag with blue plastic handl| |
|
|<img src="https://cdn-uploads.huggingface.co/production/uploads/672d17c98e098bf429c83670/oypT9482ysQjC0usEHGbT.png" alt="image/png" width="100" height="100" /> | bush ' s best white beans | |
|
|
|
|
|
|
|
|
|
### How to use the model: |
|
<details> |
|
<summary> Click to expand </summary> |
|
|
|
```python |
|
import requests |
|
from PIL import Image |
|
from transformers import BlipProcessor, BlipForConditionalGeneration |
|
|
|
processor = BlipProcessor.from_pretrained("quadranttechnologies/qhub-blip-image-captioning-finetuned") |
|
model = BlipForConditionalGeneration.from_pretrained("quadranttechnologies/qhub-blip-image-captioning-finetuned") |
|
|
|
img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' |
|
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB') |
|
|
|
# conditional image captioning |
|
text = "a photography of" |
|
inputs = processor(raw_image, text, return_tensors="pt") |
|
|
|
out = model.generate(**inputs) |
|
print(processor.decode(out[0], skip_special_tokens=True)) |
|
|
|
# unconditional image captioning |
|
inputs = processor(raw_image, return_tensors="pt") |
|
|
|
out = model.generate(**inputs) |
|
print(processor.decode(out[0], skip_special_tokens=True)) |
|
|
|
``` |
|
|
|
</details> |
|
|
|
## BibTex and citation info |
|
|
|
``` |
|
@misc{https://doi.org/10.48550/arxiv.2201.12086, |
|
doi = {10.48550/ARXIV.2201.12086}, |
|
|
|
url = {https://arxiv.org/abs/2201.12086}, |
|
|
|
author = {Li, Junnan and Li, Dongxu and Xiong, Caiming and Hoi, Steven}, |
|
|
|
keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences}, |
|
|
|
title = {BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation}, |
|
|
|
publisher = {arXiv}, |
|
|
|
year = {2022}, |
|
|
|
copyright = {Creative Commons Attribution 4.0 International} |
|
} |
|
``` |