gmastrapas
commited on
Commit
•
25ca911
1
Parent(s):
4c3db9b
feat: update intro in README
Browse files
README.md
CHANGED
@@ -141,17 +141,41 @@ inference: false
|
|
141 |
|
142 |
## Intended Usage & Model Info
|
143 |
|
144 |
-
`jina-clip-v2` is a
|
145 |
|
146 |
-
|
147 |
-
* *embedding truncation on both image and text vectors* - both towers are trained using [Matryoshka Representation Learning](https://arxiv.org/abs/2205.13147) which enables slicing the output vectors and consequently computation and storage costs.
|
148 |
-
* *visual document retrieval performance gains* - with an image resolution of 512 (compared to 224 on `jina-clip-v1`) the image tower can now capture finer visual details. This feature along with a more diverse training set enable the model to perform much better on visual document retrieval tasks. Due to this `jina-clip-v2` can be used as an image encoder in vLLM retriever architectures.
|
149 |
|
150 |
-
|
151 |
-
This dual capability makes it an excellent tool for multimodal retrieval-augmented generation (MuRAG) applications, enabling seamless text-to-text and text-to-image searches within a single model.
|
152 |
|
|
|
|
|
|
|
|
|
153 |
|
154 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
155 |
|
156 |
An updated version of our [technical report](https://arxiv.org/abs/2405.20204) with details on `jina-clip-v2` is coming soon. Stay tuned!
|
157 |
|
|
|
141 |
|
142 |
## Intended Usage & Model Info
|
143 |
|
144 |
+
`jina-clip-v2` is a **general-purpose multilingual and multimodal (text & image) embedding model**.
|
145 |
|
146 |
+
Multimodal embeddings enable searching and understanding data across different modalities through a coherent representation. They serve as the backbone of neural information retrieval and multimodal GenAI applications.
|
|
|
|
|
147 |
|
148 |
+
Built upon [`jina-clip-v1`](https://huggingface.co/jinaai/jina-clip-v1) and our recently released [`jina-embeddings-v3`](https://huggingface.co/jinaai/jina-embeddings-v3), `jina-clip-v2` features several significant improvements:
|
|
|
149 |
|
150 |
+
* **Improved Performance**: v2 shows a 3% performance improvement over v1 in both text-image and text-text retrieval tasks. Similar to v1, v2's text encoder can serve as an effective multilingual long-context dense retriever. It performs on par with our frontier model `jina-embeddings-v3` (currently the best multilingual embeddings under 1B parameters on MTEB).
|
151 |
+
* **Multilingual Support**: Powered by `jina-embeddings-v3` as the text tower, `jina-clip-v2` supports 89 languages for multilingual-image retrieval, showing up to 4% improvement compared to `nllb-clip-large-siglip` on multilingual image retrieval tasks.
|
152 |
+
* **Higher Image Resolution**: v2 now supports 512x512 input image resolution, a significant increase from v1's 224x224. This higher resolution enables better processing of detailed images, improved feature extraction, and more accurate recognition of fine-grained visual elements.
|
153 |
+
* **Matryoshka Representations**: v2 allows users to truncate the output dimensions of both text and image embeddings from 1024 down to 64, reducing storage and processing overhead while maintaining strong performance.
|
154 |
|
155 |
+
Measuring 0.9B parameters, `jina-clip-v2` combines two powerful encoders:
|
156 |
+
* the text encoder `jina-XLM-RoBERTa` (the backbone of `jina-embeddings-v3`) and
|
157 |
+
* the vision encoder `EVA02-L14` (an efficient vision Transformer developed by BAAI).
|
158 |
+
|
159 |
+
| FEATURE | TEXT ENCODER | IMAGE ENCODER |
|
160 |
+
|-----------------------|-------------------------|------------------|
|
161 |
+
| Base Model | Jina XLM-RoBERTa | EVA02-L |
|
162 |
+
| Parameters | 561M | 304M |
|
163 |
+
| Input Specification | 8,192 tokens (max) | 512×512 pixels |
|
164 |
+
| Min Output Dimensions | 64 | 64 |
|
165 |
+
| Max Output Dimensions | 1,024 | 1,024 |
|
166 |
+
| Layers | 24 | 24 |
|
167 |
+
| Attention Mechanism | FlashAttention2 | xFormers |
|
168 |
+
| Pooling Strategy | Mean pooling | CLS pooling |
|
169 |
+
| Additional Features | 89 languages supported | Patch size 14x14 |
|
170 |
+
|
171 |
+
|
172 |
+
These encoders are jointly trained to create aligned representations of images and text.
|
173 |
+
|
174 |
+
CLIP-like models have established themselves as the backbone for general-purpose multimodal applications. With `jina-clip-v2`, we're taking these capabilities to the next level, breaking down language barriers to deliver more accurate cross-modal understanding and retrieval. We're confident this release delivers a promise in making multimodal search and retrieval both more powerful and more accessible to developers worldwide.
|
175 |
+
|
176 |
+
|
177 |
+
|
178 |
+
## Training, Data, Parameters
|
179 |
|
180 |
An updated version of our [technical report](https://arxiv.org/abs/2405.20204) with details on `jina-clip-v2` is coming soon. Stay tuned!
|
181 |
|