jinaai
/

jina-clip-v2

Model card Files Files and versions Community

gmastrapas commited on 12 days ago

Commit

25ca911

•

1 Parent(s): 4c3db9b

feat: update intro in README

Browse files

Files changed (1) hide show

README.md +31 -7

README.md CHANGED Viewed

@@ -141,17 +141,41 @@ inference: false
 ## Intended Usage & Model Info
-`jina-clip-v2` is a state-of-the-art **multilingual and multimodal (text-image) embedding model**. It is a successor to the [`jina-clip-v1`](https://huggingface.co/jinaai/jina-clip-v1) model and brings new features and capabilities, such as:
-* *support for multiple languages* - the text tower is trained on 89 languages with tuning focus on *Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, Georgian, German, Greek, Hindi, Indonesian, Italian, Japanese, Korean, Latvian, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Spanish, Swedish, Thai, Turkish, Ukrainian, Urdu,* and *Vietnamese.*
-* *embedding truncation on both image and text vectors* - both towers are trained using [Matryoshka Representation Learning](https://arxiv.org/abs/2205.13147) which enables slicing the output vectors and consequently computation and storage costs.
-* *visual document retrieval performance gains* - with an image resolution of 512 (compared to 224 on `jina-clip-v1`) the image tower can now capture finer visual details. This feature along with a more diverse training set enable the model to perform much better on visual document retrieval tasks. Due to this `jina-clip-v2` can be used as an image encoder in vLLM retriever architectures.
-Similar to our predecessor model, `jina-clip-v2` bridges the gap between text-to-text and cross-modal retrieval. Via a single vector space, `jina-clip-v2` offers state-of-the-art performance on both tasks.
-This dual capability makes it an excellent tool for multimodal retrieval-augmented generation (MuRAG) applications, enabling seamless text-to-text and text-to-image searches within a single model.
-## Data, Parameters, Training
 An updated version of our [technical report](https://arxiv.org/abs/2405.20204) with details on `jina-clip-v2` is coming soon. Stay tuned!

 ## Intended Usage & Model Info
+`jina-clip-v2` is a **general-purpose multilingual and multimodal (text & image) embedding model**.
+Multimodal embeddings enable searching and understanding data across different modalities through a coherent representation. They serve as the backbone of neural information retrieval and multimodal GenAI applications.
+Built upon [`jina-clip-v1`](https://huggingface.co/jinaai/jina-clip-v1) and our recently released [`jina-embeddings-v3`](https://huggingface.co/jinaai/jina-embeddings-v3), `jina-clip-v2` features several significant improvements:
+* **Improved Performance**: v2 shows a 3% performance improvement over v1 in both text-image and text-text retrieval tasks. Similar to v1, v2's text encoder can serve as an effective multilingual long-context dense retriever. It performs on par with our frontier model `jina-embeddings-v3` (currently the best multilingual embeddings under 1B parameters on MTEB).
+* **Multilingual Support**: Powered by `jina-embeddings-v3` as the text tower, `jina-clip-v2` supports 89 languages for multilingual-image retrieval, showing up to 4% improvement compared to `nllb-clip-large-siglip` on multilingual image retrieval tasks.
+* **Higher Image Resolution**: v2 now supports 512x512 input image resolution, a significant increase from v1's 224x224. This higher resolution enables better processing of detailed images, improved feature extraction, and more accurate recognition of fine-grained visual elements.
+* **Matryoshka Representations**: v2 allows users to truncate the output dimensions of both text and image embeddings from 1024 down to 64, reducing storage and processing overhead while maintaining strong performance.
+Measuring 0.9B parameters, `jina-clip-v2` combines two powerful encoders:
+* the text encoder `jina-XLM-RoBERTa` (the backbone of `jina-embeddings-v3`) and
+* the vision encoder `EVA02-L14` (an efficient vision Transformer developed by BAAI).
+| FEATURE               | TEXT ENCODER            | IMAGE ENCODER    |
+|-----------------------|-------------------------|------------------|
+| Base Model	           | Jina XLM-RoBERTa	       | EVA02-L          |
+| Parameters	           | 561M                    | 304M             |
+| Input Specification	  | 8,192 tokens (max)	     | 512×512 pixels   |
+| Min Output Dimensions | 64                      | 64               |
+| Max Output Dimensions | 1,024                   | 1,024            |
+| Layers	               | 24                      | 24               |
+| Attention Mechanism	  | FlashAttention2	        | xFormers         |
+| Pooling Strategy	     | Mean pooling	           | CLS pooling      |
+| Additional Features	  | 89 languages supported	 | Patch size 14x14 |
+These encoders are jointly trained to create aligned representations of images and text.
+CLIP-like models have established themselves as the backbone for general-purpose multimodal applications. With `jina-clip-v2`, we're taking these capabilities to the next level, breaking down language barriers to deliver more accurate cross-modal understanding and retrieval. We're confident this release delivers a promise in making multimodal search and retrieval both more powerful and more accessible to developers worldwide.
+## Training, Data, Parameters
 An updated version of our [technical report](https://arxiv.org/abs/2405.20204) with details on `jina-clip-v2` is coming soon. Stay tuned!