gmastrapas commited on
Commit
25ca911
1 Parent(s): 4c3db9b

feat: update intro in README

Browse files
Files changed (1) hide show
  1. README.md +31 -7
README.md CHANGED
@@ -141,17 +141,41 @@ inference: false
141
 
142
  ## Intended Usage & Model Info
143
 
144
- `jina-clip-v2` is a state-of-the-art **multilingual and multimodal (text-image) embedding model**. It is a successor to the [`jina-clip-v1`](https://huggingface.co/jinaai/jina-clip-v1) model and brings new features and capabilities, such as:
145
 
146
- * *support for multiple languages* - the text tower is trained on 89 languages with tuning focus on *Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, Georgian, German, Greek, Hindi, Indonesian, Italian, Japanese, Korean, Latvian, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Spanish, Swedish, Thai, Turkish, Ukrainian, Urdu,* and *Vietnamese.*
147
- * *embedding truncation on both image and text vectors* - both towers are trained using [Matryoshka Representation Learning](https://arxiv.org/abs/2205.13147) which enables slicing the output vectors and consequently computation and storage costs.
148
- * *visual document retrieval performance gains* - with an image resolution of 512 (compared to 224 on `jina-clip-v1`) the image tower can now capture finer visual details. This feature along with a more diverse training set enable the model to perform much better on visual document retrieval tasks. Due to this `jina-clip-v2` can be used as an image encoder in vLLM retriever architectures.
149
 
150
- Similar to our predecessor model, `jina-clip-v2` bridges the gap between text-to-text and cross-modal retrieval. Via a single vector space, `jina-clip-v2` offers state-of-the-art performance on both tasks.
151
- This dual capability makes it an excellent tool for multimodal retrieval-augmented generation (MuRAG) applications, enabling seamless text-to-text and text-to-image searches within a single model.
152
 
 
 
 
 
153
 
154
- ## Data, Parameters, Training
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
155
 
156
  An updated version of our [technical report](https://arxiv.org/abs/2405.20204) with details on `jina-clip-v2` is coming soon. Stay tuned!
157
 
 
141
 
142
  ## Intended Usage & Model Info
143
 
144
+ `jina-clip-v2` is a **general-purpose multilingual and multimodal (text & image) embedding model**.
145
 
146
+ Multimodal embeddings enable searching and understanding data across different modalities through a coherent representation. They serve as the backbone of neural information retrieval and multimodal GenAI applications.
 
 
147
 
148
+ Built upon [`jina-clip-v1`](https://huggingface.co/jinaai/jina-clip-v1) and our recently released [`jina-embeddings-v3`](https://huggingface.co/jinaai/jina-embeddings-v3), `jina-clip-v2` features several significant improvements:
 
149
 
150
+ * **Improved Performance**: v2 shows a 3% performance improvement over v1 in both text-image and text-text retrieval tasks. Similar to v1, v2's text encoder can serve as an effective multilingual long-context dense retriever. It performs on par with our frontier model `jina-embeddings-v3` (currently the best multilingual embeddings under 1B parameters on MTEB).
151
+ * **Multilingual Support**: Powered by `jina-embeddings-v3` as the text tower, `jina-clip-v2` supports 89 languages for multilingual-image retrieval, showing up to 4% improvement compared to `nllb-clip-large-siglip` on multilingual image retrieval tasks.
152
+ * **Higher Image Resolution**: v2 now supports 512x512 input image resolution, a significant increase from v1's 224x224. This higher resolution enables better processing of detailed images, improved feature extraction, and more accurate recognition of fine-grained visual elements.
153
+ * **Matryoshka Representations**: v2 allows users to truncate the output dimensions of both text and image embeddings from 1024 down to 64, reducing storage and processing overhead while maintaining strong performance.
154
 
155
+ Measuring 0.9B parameters, `jina-clip-v2` combines two powerful encoders:
156
+ * the text encoder `jina-XLM-RoBERTa` (the backbone of `jina-embeddings-v3`) and
157
+ * the vision encoder `EVA02-L14` (an efficient vision Transformer developed by BAAI).
158
+
159
+ | FEATURE | TEXT ENCODER | IMAGE ENCODER |
160
+ |-----------------------|-------------------------|------------------|
161
+ | Base Model | Jina XLM-RoBERTa | EVA02-L |
162
+ | Parameters | 561M | 304M |
163
+ | Input Specification | 8,192 tokens (max) | 512×512 pixels |
164
+ | Min Output Dimensions | 64 | 64 |
165
+ | Max Output Dimensions | 1,024 | 1,024 |
166
+ | Layers | 24 | 24 |
167
+ | Attention Mechanism | FlashAttention2 | xFormers |
168
+ | Pooling Strategy | Mean pooling | CLS pooling |
169
+ | Additional Features | 89 languages supported | Patch size 14x14 |
170
+
171
+
172
+ These encoders are jointly trained to create aligned representations of images and text.
173
+
174
+ CLIP-like models have established themselves as the backbone for general-purpose multimodal applications. With `jina-clip-v2`, we're taking these capabilities to the next level, breaking down language barriers to deliver more accurate cross-modal understanding and retrieval. We're confident this release delivers a promise in making multimodal search and retrieval both more powerful and more accessible to developers worldwide.
175
+
176
+
177
+
178
+ ## Training, Data, Parameters
179
 
180
  An updated version of our [technical report](https://arxiv.org/abs/2405.20204) with details on `jina-clip-v2` is coming soon. Stay tuned!
181