jupyterjazz commited on
Commit
1ecacfa
1 Parent(s): 3252a4e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -15
README.md CHANGED
@@ -123,31 +123,36 @@ library_name: transformers
123
  The easiest way to starting using `jina-embeddings-v3` is to use Jina AI's [Embedding API](https://jina.ai/embeddings/).
124
 
125
 
126
- ## Intended Usage & Model Info
127
 
128
- `jina-embeddings-v3` is a multilingual **text embedding model** supporting **8192 sequence length**.
129
- It is based on a XLMRoBERTa architecture (JinaXLMRoBERTa) that supports the Rotary Position Embeddings to allow longer sequence length.
130
- The backbone `JinaXLMRoBERTa ` is pretrained on variable length textual data on Mask Language Modeling objective for 160k steps on 89 languages.
131
- The model is further trained on Jina AI's collection of more than 500 millions of multilingual sentence pairs and hard negatives.
132
- These pairs were obtained from various domains and were carefully selected through a thorough cleaning process.
133
 
134
- `jina-embeddings-v3` has 5 task-specific LoRA adapters tuned on top of our backbone, add `task_type` as additional parameter when using the model:
 
 
135
 
136
- TODO UPDATE THIS
 
 
 
 
 
 
 
 
137
 
138
- 1. **query**: Handles user incoming queries at search time.
139
- 2. **index**: Manages user documents submitted for indexing.
140
- 3. **text-matching**: Processes symmetric text similarity tasks, whether short or long, such as STS (Semantic Textual Similarity).
141
- 4. **classification**: Classifies user inputs into predefined categories.
142
- 5. **clustering**: Facilitates the clustering of embeddings for further analysis.
143
 
144
- `jina-embeddings-v3` supports Matryoshka representation learning. We recommend using an embedding size of 128 or higher (1024 provides optimal performance) for storing your embeddings.
 
145
 
 
 
 
146
 
147
 
148
  ## Data & Parameters
149
 
150
- coming soon.
151
 
152
  ## Usage
153
 
 
123
  The easiest way to starting using `jina-embeddings-v3` is to use Jina AI's [Embedding API](https://jina.ai/embeddings/).
124
 
125
 
126
+ ## Intended Usage & Model info
127
 
 
 
 
 
 
128
 
129
+ `jina-embeddings-v3` is a **multilingual multi-task text embedding model** designed for a variety of NLP applications.
130
+ Based on the [XLM-RoBERTa architecture](https://huggingface.co/jinaai/xlm-roberta-flash-implementation), this model supports [Rotary Position Embeddings (RoPE)](https://arxiv.org/abs/2104.09864) to handle long sequences up to **8192 tokens**.
131
+ Additionally, it features [LoRA](https://arxiv.org/abs/2106.09685) adapters to generate task-specific embeddings efficiently.
132
 
133
+ ### Key Features:
134
+ - **Extended Sequence Length:** Supports up to 8192 tokens with RoPE.
135
+ - **Task-Specific Embedding:** Customize embeddings through the `task_type` argument with the following options:
136
+ - `retrieval.query`: Query encoding for asymmetric retrieval tasks
137
+ - `retrieval.passage`: Passage encoding for asymmetric retrieval tasks
138
+ - `separation`: For clustering and re-ranking applications
139
+ - `classification`: For classification tasks
140
+ - `text-matching`: For measuring textual similarity
141
+ - **Matryoshka Embeddings**: Supports flexible embedding sizes (`32, 64, 128, 256, 512, 768, 1024`), allowing for truncating embeddings to fit your application.
142
 
143
+ ### Model Lineage:
 
 
 
 
144
 
145
+ `jina-embeddings-v3` builds upon the [FacebookAI/xlm-roberta-large](https://huggingface.co/FacebookAI/xlm-roberta-large) model, which was originally trained on 100 languages.
146
+ We extended its capabilities with an extra pretraining phase on the [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX) dataset, then contrastively fine-tuned it on 30 languages for enhanced performance in both monolingual and cross-lingual setups.
147
 
148
+ ### Supported Languages:
149
+ While the base model supports 100 languages, we've focused our tuning efforts on the following 30 languages to maximize performance:
150
+ **Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, Georgian, German, Greek, Hindi, Indonesian, Italian, Japanese, Korean, Latvian, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Spanish, Swedish, Thai, Turkish, Ukrainian, Urdu,** and **Vietnamese.**
151
 
152
 
153
  ## Data & Parameters
154
 
155
+ The data and training details are described in this technical report (coming soon).
156
 
157
  ## Usage
158