File size: 3,989 Bytes
c14924b
99b1d64
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c14924b
 
53e2e6d
c14924b
 
99b1d64
c14924b
99b1d64
c14924b
99b1d64
c14924b
99b1d64
c14924b
99b1d64
 
c14924b
99b1d64
 
 
 
c14924b
53e2e6d
 
c14924b
99b1d64
 
c14924b
99b1d64
 
 
 
53e2e6d
c14924b
99b1d64
 
c14924b
99b1d64
c14924b
99b1d64
c14924b
99b1d64
 
c14924b
99b1d64
 
 
 
c14924b
53e2e6d
 
c14924b
99b1d64
 
c14924b
99b1d64
 
 
 
53e2e6d
99b1d64
 
c14924b
99b1d64
c14924b
99b1d64
 
c14924b
99b1d64
 
 
 
 
c14924b
53e2e6d
 
c14924b
99b1d64
 
c14924b
99b1d64
 
 
 
53e2e6d
99b1d64
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
---
pipeline_tag: image-to-text
tags:
- image-captioning
languages:
- en
license: bsd-3-clause
widget:
- src: >-
    https://huggingface.co/datasets/mishig/sample_images/resolve/main/savanna.jpg
  example_title: Savanna
- src: >-
    https://huggingface.co/datasets/mishig/sample_images/resolve/main/football-match.jpg
  example_title: Football Match
- src: >-
    https://huggingface.co/datasets/mishig/sample_images/resolve/main/airport.jpg
  example_title: Airport
datasets:
- unography/laion-14k-GPT4V-LIVIS-Captions
inference:
  parameters:
    max_length: 300
---

# LongCap: Finetuned [BLIP](https://huggingface.co/Salesforce/blip-image-captioning-large) for generating long captions of images, suitable for prompts for text-to-image generation and captioning text-to-image datasets


## Usage

You can use this model for conditional and un-conditional image captioning

### Using the Pytorch model

#### Running the model on CPU

<details>
<summary> Click to expand </summary>

```python
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("unography/blip-large-long-cap")
model = BlipForConditionalGeneration.from_pretrained("unography/blip-large-long-cap")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

inputs = processor(raw_image, return_tensors="pt")
pixel_values = inputs.pixel_values
out = model.generate(pixel_values=pixel_values, max_length=250)
print(processor.decode(out[0], skip_special_tokens=True))
>>> a woman sitting on the beach, wearing a checkered shirt and a dog collar. the woman is interacting with the dog, which is positioned towards the left side of the image. the setting is a beachfront with a calm sea and a golden hue.

```
</details>

#### Running the model on GPU

##### In full precision 

<details>
<summary> Click to expand </summary>

```python
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("unography/blip-large-long-cap")
model = BlipForConditionalGeneration.from_pretrained("unography/blip-large-long-cap").to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

inputs = processor(raw_image, return_tensors="pt").to("cuda")
pixel_values = inputs.pixel_values
out = model.generate(pixel_values=pixel_values, max_length=250)
print(processor.decode(out[0], skip_special_tokens=True))
>>> a woman sitting on the beach, wearing a checkered shirt and a dog collar. the woman is interacting with the dog, which is positioned towards the left side of the image. the setting is a beachfront with a calm sea and a golden hue.
```
</details>

##### In half precision (`float16`)

<details>
<summary> Click to expand </summary>

```python
import torch
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("unography/blip-large-long-cap")
model = BlipForConditionalGeneration.from_pretrained("unography/blip-large-long-cap", torch_dtype=torch.float16).to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)
pixel_values = inputs.pixel_values
out = model.generate(pixel_values=pixel_values, max_length=250)
print(processor.decode(out[0], skip_special_tokens=True))
>>> a woman sitting on the beach, wearing a checkered shirt and a dog collar. the woman is interacting with the dog, which is positioned towards the left side of the image. the setting is a beachfront with a calm sea and a golden hue.
```
</details>