gmastrapas
commited on
Commit
•
44077eb
1
Parent(s):
25ca911
docs: minor README fixes
Browse files
README.md
CHANGED
@@ -148,17 +148,17 @@ Multimodal embeddings enable searching and understanding data across different m
|
|
148 |
Built upon [`jina-clip-v1`](https://huggingface.co/jinaai/jina-clip-v1) and our recently released [`jina-embeddings-v3`](https://huggingface.co/jinaai/jina-embeddings-v3), `jina-clip-v2` features several significant improvements:
|
149 |
|
150 |
* **Improved Performance**: v2 shows a 3% performance improvement over v1 in both text-image and text-text retrieval tasks. Similar to v1, v2's text encoder can serve as an effective multilingual long-context dense retriever. It performs on par with our frontier model `jina-embeddings-v3` (currently the best multilingual embeddings under 1B parameters on MTEB).
|
151 |
-
* **Multilingual Support**:
|
152 |
* **Higher Image Resolution**: v2 now supports 512x512 input image resolution, a significant increase from v1's 224x224. This higher resolution enables better processing of detailed images, improved feature extraction, and more accurate recognition of fine-grained visual elements.
|
153 |
* **Matryoshka Representations**: v2 allows users to truncate the output dimensions of both text and image embeddings from 1024 down to 64, reducing storage and processing overhead while maintaining strong performance.
|
154 |
|
155 |
Measuring 0.9B parameters, `jina-clip-v2` combines two powerful encoders:
|
156 |
-
* the text encoder `
|
157 |
* the vision encoder `EVA02-L14` (an efficient vision Transformer developed by BAAI).
|
158 |
|
159 |
| FEATURE | TEXT ENCODER | IMAGE ENCODER |
|
160 |
|-----------------------|-------------------------|------------------|
|
161 |
-
| Base Model | Jina
|
162 |
| Parameters | 561M | 304M |
|
163 |
| Input Specification | 8,192 tokens (max) | 512×512 pixels |
|
164 |
| Min Output Dimensions | 64 | 64 |
|
@@ -330,12 +330,16 @@ sentences = [
|
|
330 |
image_urls = ['https://i.ibb.co/nQNGqL0/beach1.jpg', 'https://i.ibb.co/r5w8hG8/beach2.jpg']
|
331 |
|
332 |
# Encode text and images
|
333 |
-
text_embeddings = model.encode(sentences)
|
334 |
-
image_embeddings = model.encode(
|
|
|
|
|
335 |
|
336 |
# Encode query text
|
337 |
query = 'beautiful sunset over the beach' # English
|
338 |
-
query_embeddings = model.encode(
|
|
|
|
|
339 |
```
|
340 |
</details>
|
341 |
|
@@ -388,7 +392,7 @@ _, _, text_embeddings, image_embeddings = output
|
|
388 |
|
389 |
## License
|
390 |
|
391 |
-
|
392 |
|
393 |
|
394 |
## Contact
|
|
|
148 |
Built upon [`jina-clip-v1`](https://huggingface.co/jinaai/jina-clip-v1) and our recently released [`jina-embeddings-v3`](https://huggingface.co/jinaai/jina-embeddings-v3), `jina-clip-v2` features several significant improvements:
|
149 |
|
150 |
* **Improved Performance**: v2 shows a 3% performance improvement over v1 in both text-image and text-text retrieval tasks. Similar to v1, v2's text encoder can serve as an effective multilingual long-context dense retriever. It performs on par with our frontier model `jina-embeddings-v3` (currently the best multilingual embeddings under 1B parameters on MTEB).
|
151 |
+
* **Multilingual Support**: Using the same backbone as `jina-embeddings-v3` for the text tower, `jina-clip-v2` supports 89 languages for multilingual-image retrieval, showing up to 4% improvement compared to `nllb-clip-large-siglip` on multilingual image retrieval tasks.
|
152 |
* **Higher Image Resolution**: v2 now supports 512x512 input image resolution, a significant increase from v1's 224x224. This higher resolution enables better processing of detailed images, improved feature extraction, and more accurate recognition of fine-grained visual elements.
|
153 |
* **Matryoshka Representations**: v2 allows users to truncate the output dimensions of both text and image embeddings from 1024 down to 64, reducing storage and processing overhead while maintaining strong performance.
|
154 |
|
155 |
Measuring 0.9B parameters, `jina-clip-v2` combines two powerful encoders:
|
156 |
+
* the text encoder `Jina-XLM-RoBERTa` (the backbone of `jina-embeddings-v3`) and
|
157 |
* the vision encoder `EVA02-L14` (an efficient vision Transformer developed by BAAI).
|
158 |
|
159 |
| FEATURE | TEXT ENCODER | IMAGE ENCODER |
|
160 |
|-----------------------|-------------------------|------------------|
|
161 |
+
| Base Model | Jina-XLM-RoBERTa | EVA02-L |
|
162 |
| Parameters | 561M | 304M |
|
163 |
| Input Specification | 8,192 tokens (max) | 512×512 pixels |
|
164 |
| Min Output Dimensions | 64 | 64 |
|
|
|
330 |
image_urls = ['https://i.ibb.co/nQNGqL0/beach1.jpg', 'https://i.ibb.co/r5w8hG8/beach2.jpg']
|
331 |
|
332 |
# Encode text and images
|
333 |
+
text_embeddings = model.encode(sentences, normalize_embeddings=True)
|
334 |
+
image_embeddings = model.encode(
|
335 |
+
image_urls, normalize_embeddings=True
|
336 |
+
) # also accepts PIL.Image.Image, local filenames, dataURI
|
337 |
|
338 |
# Encode query text
|
339 |
query = 'beautiful sunset over the beach' # English
|
340 |
+
query_embeddings = model.encode(
|
341 |
+
query, prompt_name='retrieval.query', normalize_embeddings=True
|
342 |
+
)
|
343 |
```
|
344 |
</details>
|
345 |
|
|
|
392 |
|
393 |
## License
|
394 |
|
395 |
+
This model is licensed to download and run under [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/deed.en). It is available for commercial use via the [Jina Embeddings API](https://jina.ai/embeddings/), [AWS](https://aws.amazon.com/marketplace/pp/prodview-bfbctuqmky676), [Azure](https://azuremarketplace.microsoft.com/en-gb/marketplace/apps/jinaai.jina-clip-v2-vm?tab=Overview), and [GCP](https://console.cloud.google.com/marketplace/browse?hl=en&inv=1&invt=AbiFWQ&q=jina). To download for commercial use, please [contact us](https://jina.ai/contact-sales).
|
396 |
|
397 |
|
398 |
## Contact
|