File size: 8,286 Bytes
b8b8f72
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
727b2f2
 
b8b8f72
727b2f2
 
 
 
b8b8f72
727b2f2
 
 
 
 
b8b8f72
 
 
 
727b2f2
b8b8f72
727b2f2
 
 
b8b8f72
 
727b2f2
 
 
 
 
 
 
b8b8f72
727b2f2
 
 
b8b8f72
727b2f2
b8b8f72
 
727b2f2
b8b8f72
727b2f2
 
 
b8b8f72
727b2f2
 
 
 
 
b8b8f72
727b2f2
 
 
 
 
 
 
 
b8b8f72
 
727b2f2
 
 
b8b8f72
727b2f2
b8b8f72
 
 
727b2f2
 
 
b8b8f72
727b2f2
 
 
 
 
b8b8f72
727b2f2
 
 
b8b8f72
727b2f2
b8b8f72
727b2f2
b8b8f72
 
727b2f2
b8b8f72
727b2f2
 
b8b8f72
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
806ad6f
b8b8f72
 
3f4007a
727b2f2
 
fc3dba2
727b2f2
b8b8f72
 
 
0b8e5d5
b8b8f72
14ec304
3f4007a
 
b8b8f72
 
 
 
 
 
 
 
 
3f4007a
af015b7
b8b8f72
 
 
 
af015b7
b8b8f72
af015b7
 
 
ce68525
af015b7
a3292d3
ce68525
a3292d3
 
 
 
 
 
 
 
 
ce68525
af015b7
 
a3292d3
 
b8b8f72
 
 
af015b7
b8b8f72
ce68525
 
 
 
 
a3292d3
 
ce68525
 
af015b7
a3292d3
 
 
 
 
 
 
 
 
 
 
 
af015b7
b8b8f72
 
 
 
ef81d78
b8b8f72
 
 
ef81d78
 
 
 
b8b8f72
a3292d3
ef81d78
a3292d3
 
 
 
 
 
 
 
 
ef81d78
b8b8f72
a3292d3
 
b8b8f72
 
 
a3292d3
 
b8b8f72
 
3f4007a
b8b8f72
 
 
 
3f4007a
b8b8f72
 
 
 
 
 
 
 
 
 
 
727b2f2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
---
library_name: transformers
license: cc-by-nc-4.0
tags:
- xlm-roberta
- eva02
- clip
- feature-extraction
- sentence-similarity
- retrieval
- multimodal
- multi-modal
- crossmodal
- cross-modal
- mteb
- clip-benchmark
- vidore
- transformers
- sentence-transformers
- onnx
- safetensors
- transformers.js
language:
  - multilingual
  - af
  - am
  - ar
  - as
  - az
  - be
  - bg
  - bn
  - br
  - bs
  - ca
  - cs
  - cy
  - da
  - de
  - el
  - en
  - eo
  - es
  - et
  - eu
  - fa
  - fi
  - fr
  - fy
  - ga
  - gd
  - gl
  - gu
  - ha
  - he
  - hi
  - hr
  - hu
  - hy
  - id
  - is
  - it
  - ja
  - jv
  - ka
  - kk
  - km
  - kn
  - ko
  - ku
  - ky
  - la
  - lo
  - lt
  - lv
  - mg
  - mk
  - ml
  - mn
  - mr
  - ms
  - my
  - ne
  - nl
  - no
  - om
  - or
  - pa
  - pl
  - ps
  - pt
  - ro
  - ru
  - sa
  - sd
  - si
  - sk
  - sl
  - so
  - sq
  - sr
  - su
  - sv
  - sw
  - ta
  - te
  - th
  - tl
  - tr
  - ug
  - uk
  - ur
  - uz
  - vi
  - xh
  - yi
  - zh
inference: false
---

<br><br>

<p align="center">
<img src="https://aeiljuispo.cloudimg.io/v7/https://cdn-uploads.huggingface.co/production/uploads/603763514de52ff951d89793/AFoybzd5lpBQXEBrQHuTt.png?w=200&h=200&f=face" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px">
</p>


<p align="center">
<b>The embedding set trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
</p>

<p align="center">
<b>Jina CLIP v2: Multilingual Multimodal Embeddings for Texts and Images</b>
</p>


## Quick Start

[Blog]() | [Azure](https://azuremarketplace.microsoft.com/en-gb/marketplace/apps/jinaai.jina-clip-v2-vm?tab=Overview) | [AWS SageMaker](https://aws.amazon.com/marketplace/pp/prodview-bfbctuqmky676) | [Google Cloud Platform](https://console.cloud.google.com/marketplace/browse?hl=en&inv=1&invt=AbiD-g&q=jina) | [API](https://jina.ai/embeddings)


## Intended Usage & Model Info

`jina-clip-v2` is a state-of-the-art **multilingual and multimodal (text-image) embedding model**. It is a successor to the [`jina-clip-v1`](https://huggingface.co/jinaai/jina-clip-v1) model and brings new features and capabilities, such as:

* *support for multiple languages* - the text tower is trained on 89 languages with tuning focus on *Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, Georgian, German, Greek, Hindi, Indonesian, Italian, Japanese, Korean, Latvian, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Spanish, Swedish, Thai, Turkish, Ukrainian, Urdu,* and *Vietnamese.*
* *embedding truncation on both image and text vectors* - both towers are trained using [Matryoshka Representation Learning](https://arxiv.org/abs/2205.13147) which enables slicing the output vectors and consequently computation and storage costs.
* *visual document retrieval performance gains* - with an image resolution of 512 (compared to 224 on `jina-clip-v1`) the image tower can now capture finer visual details. This feature along with a more diverse training set enable the model to perform much better on visual document retrieval tasks. Due to this `jina-clip-v2` can be used as an image encoder in vLLM retriever architectures.

Similar to our predecessor model, `jina-clip-v2` bridges the gap between text-to-text and cross-modal retrieval. Via a single vector space, `jina-clip-v2` offers state-of-the-art performance on both tasks.
This dual capability makes it an excellent tool for multimodal retrieval-augmented generation (MuRAG) applications, enabling seamless text-to-text and text-to-image searches within a single model.


## Data & Parameters

[Check out our paper](https://arxiv.org/abs/2405.20204). Updated technical report for v2 coming soon!


## Usage

1. The easiest way to start using jina-clip-v2 is via Jina AI's [Embeddings API](https://jina.ai/embeddings/).
2. Alternatively, you can use the model directly via the transformers/sentence-transformers package.

```python
# !pip install transformers einops timm pillow
from transformers import AutoModel

# Initialize the model
model = AutoModel.from_pretrained("jinaai/jina-clip-v2", trust_remote_code=True)

# Corpus
sentences = [
    "طاهٍ يطبخ المعكرونة في المطبخ", # Arabic
    "厨师在厨房煮意大利面", # Chinese
    "Un chef qui cuisine des pâtes dans la cuisine", # French
    "Ein Koch, der in der Küche Pasta kocht", # German
    "Ένας σεφ μαγειρεύει ζυμαρικά στην κουζίνα", # Greek
    "एक शेफ रसोई में पास्ता पका रहा है", # Hindi
    "Uno chef che cucina la pasta in cucina", # Italian
    "シェフがキッチンでパスタを作っている", # Japanese
    "셰프가 주방에서 파스타를 요리하고 있다", # Korean
]


# Public image URLs or Pil
image_urls = ["https://i.ibb.co/bRGGJxD/DALL-E-2024-11-20-13-44-46-A-highly-realistic-8-K-photographic-image-of-a-chef-cooking-pasta-in-a-mo.webp"]
# Choose a matryoshka dimension, set to None to get the full 1024-dim vectors
truncate_dim = 512

# Encode text and images
text_embeddings = model.encode_text(sentences, truncate_dim=truncate_dim)
image_embeddings = model.encode_image(
    image_urls, truncate_dim=truncate_dim
)  # also accepts PIL.image, local filenames, dataURI

# Encode query text
query = "A chef cooking pasta in the kitchen" # English
query_embeddings = model.encode_text(
    query, task="retrieval.query", truncate_dim=truncate_dim
)

# text to image
print("En -> Img: " + str(query_embeddings @ image_embeddings[0].T))
# text to text
print("En -> Ar: " + str(query_embeddings @ text_embeddings[0].T))
print("En -> Zh: " + str(query_embeddings @ text_embeddings[1].T))
print("En -> Fr: " + str(query_embeddings @ text_embeddings[2].T))
print("En -> De: " + str(query_embeddings @ text_embeddings[3].T))
print("En -> Gr: " + str(query_embeddings @ text_embeddings[4].T))
print("En -> Hi: " + str(query_embeddings @ text_embeddings[5].T))
print("En -> It: " + str(query_embeddings @ text_embeddings[6].T))
print("En -> Jp: " + str(query_embeddings @ text_embeddings[7].T))
print("En -> Ko: " + str(query_embeddings @ text_embeddings[8].T))
```

or via sentence-transformers:

```python
# !pip install sentence-transformers einops timm pillow
from sentence_transformers import SentenceTransformer

# Initialize the model
truncate_dim = 512
model = SentenceTransformer(
    "jinaai/jina-clip-v2", trust_remote_code=True, truncate_dim=truncate_dim
)

# Corpus
sentences = [
    "طاهٍ يطبخ المعكرونة في المطبخ", # Arabic
    "厨师在厨房煮意大利面", # Chinese
    "Un chef qui cuisine des pâtes dans la cuisine", # French
    "Ein Koch, der in der Küche Pasta kocht", # German
    "Ένας σεφ μαγειρεύει ζυμαρικά στην κουζίνα", # Greek
    "एक शेफ रसोई में पास्ता पका रहा है", # Hindi
    "Uno chef che cucina la pasta in cucina", # Italian
    "シェフがキッチンでパスタを作っている", # Japanese
    "셰프가 주방에서 파스타를 요리하고 있다", # Korean
]

# Public image URLs or Pil
image_urls = ["https://i.ibb.co/bRGGJxD/DALL-E-2024-11-20-13-44-46-A-highly-realistic-8-K-photographic-image-of-a-chef-cooking-pasta-in-a-mo.webp"]

text_embeddings = model.encode(sentences)
image_embeddings = model.encode(image_urls)
query = "A chef cooking pasta in the kitchen" # English
query_embeddings = model.encode(query)
```


## Contact

Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas.


## Citation

If you find `jina-clip-v2` useful in your research, please cite the following paper:

```bibtex
@misc{2405.20204,
    Author = {Andreas Koukounas and Georgios Mastrapas and Michael Günther and Bo Wang and Scott Martens and Isabelle Mohr and Saba Sturua and Mohammad Kalim Akram and Joan Fontanals Martínez and Saahil Ognawala and Susana Guzman and Maximilian Werk and Nan Wang and Han Xiao},
    Title = {Jina CLIP: Your CLIP Model Is Also Your Text Retriever},
    Year = {2024},
    Eprint = {arXiv:2405.20204},
}
```