bwang0911 commited on
Commit
7f19d36
1 Parent(s): afc9c17

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +106 -23
README.md CHANGED
@@ -1063,53 +1063,136 @@ model-index:
1063
  - type: f1
1064
  value: 82.61420654584116
1065
  ---
1066
-
1067
  <!-- TODO: add evaluation results here -->
1068
  <br><br>
1069
 
1070
  <p align="center">
1071
- <img src="https://github.com/jina-ai/finetuner/blob/main/docs/_static/finetuner-logo-ani.svg?raw=true" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px">
1072
  </p>
1073
 
1074
 
1075
  <p align="center">
1076
- <b>The text embedding set trained by <a href="https://jina.ai/"><b>Jina AI</b></a>, <a href="https://github.com/jina-ai/finetuner"><b>Finetuner</b></a> team.</b>
1077
  </p>
1078
 
1079
 
1080
  ## Intended Usage & Model Info
1081
 
1082
- `jina-embeddings-v2-base-zh` is a Chinese/English bilingual text **embedding model** supporting **8192 sequence length**. Our model has the same architecture as `jina-embeddings-v2-base-en` and has 161 million parameters.
1083
- We have designed it for high performance in cross-language applications and trained it specifically to support mixed Chinese-English input without bias.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1084
 
1085
- | Model | Language | Max Sequence Length | Dimension | Model Size |
1086
- |:-----: | :-----: |:-----: |:-----: |:-----: |
1087
- |[jina-embeddings-v2-base-en](https://huggingface.co/jinaai/jina-embeddings-v2-base-en) | English | 8192 | 768 | 0.27GB |
1088
- |[jina-embeddings-v2-base-zh](https://huggingface.co/jinaai/jina-embeddings-v2-base-zh) | Chinese and English | 8192 | 768 | 0.32GB |
1089
 
 
 
 
1090
 
1091
- You can use the embedding model either via the Jina AI's [Embedding platform](https://jina.ai/embeddings/), AWS SageMaker or in your private deployments.
1092
 
1093
- ## Usage Jina Embedding API
 
 
 
1094
 
1095
- The following code snippet shows the usage of the Jina Embedding API:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1096
  ```
1097
- curl https://api.jina.ai/v1/embeddings \
1098
- -H "Content-Type: application/json" \
1099
- -H "Authorization: Bearer jina_xxxxxxx" \
1100
- -d '{
1101
- "input": ["你的输入可以是纯中文", "or purely in English", "or like mixture of 中文 and 英文"],
1102
- "model": "jina-embeddings-v2-base-zh"
1103
- }'
1104
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1105
  ```
1106
 
1107
- Get your free API key on: https://jina.ai/embeddings/
 
 
 
 
 
 
1108
 
1109
- ## Opensource
1110
 
1111
- We will add more information about this model and opensource the full model in a few days!
 
 
 
 
 
 
 
1112
 
1113
  ## Contact
1114
 
1115
- Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1063
  - type: f1
1064
  value: 82.61420654584116
1065
  ---
 
1066
  <!-- TODO: add evaluation results here -->
1067
  <br><br>
1068
 
1069
  <p align="center">
1070
+ <img src="https://aeiljuispo.cloudimg.io/v7/https://cdn-uploads.huggingface.co/production/uploads/603763514de52ff951d89793/AFoybzd5lpBQXEBrQHuTt.png?w=200&h=200&f=face" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px">
1071
  </p>
1072
 
1073
 
1074
  <p align="center">
1075
+ <b>The text embedding set trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
1076
  </p>
1077
 
1078
 
1079
  ## Intended Usage & Model Info
1080
 
1081
+ `jina-embeddings-v2-base-zh` is a Chinese/English bilingual text **embedding model** supporting **8192 sequence length**.
1082
+ It is based on a BERT architecture (JinaBERT) that supports the symmetric bidirectional variant of [ALiBi](https://arxiv.org/abs/2108.12409) to allow longer sequence length.
1083
+ We have designed it for high performance in mongolingual & cross-language applications and trained it specifically to support mixed Chinese-English input without bias.
1084
+
1085
+ The embedding model was trained using 512 sequence length, but extrapolates to 8k sequence length (or even longer) thanks to ALiBi.
1086
+ This makes our model useful for a range of use cases, especially when processing long documents is needed, including long document retrieval, semantic textual similarity, text reranking, recommendation, RAG and LLM-based generative search, etc.
1087
+
1088
+ With a standard size of 161 million parameters, the model enables fast inference while delivering better performance than our small model. It is recommended to use a single GPU for inference.
1089
+ Additionally, we provide the following embedding models:
1090
+
1091
+ - [`jina-embeddings-v2-small-en`](https://huggingface.co/jinaai/jina-embeddings-v2-small-en): 33 million parameters.
1092
+ - [`jina-embeddings-v2-base-en`](https://huggingface.co/jinaai/jina-embeddings-v2-base-en): 137 million parameters.
1093
+ - [`jina-embeddings-v2-base-zh`](): Chinese-English Bilingual embeddings (soon) **(you are here)**.
1094
+ - [`jina-embeddings-v2-base-de`](): German-English Bilingual embeddings (soon).
1095
+ - [`jina-embeddings-v2-base-es`](): Spanish-English Bilingual embeddings (soon).
1096
+
1097
+ ## Data & Parameters
1098
+
1099
+ Jina Embeddings V2 [technical report](https://arxiv.org/abs/2310.19923)
1100
+
1101
+ ## Usage
1102
+
1103
+ **<details><summary>Please apply mean pooling when integrating the model.</summary>**
1104
+ <p>
1105
 
1106
+ ### Why mean pooling?
 
 
 
1107
 
1108
+ `mean poooling` takes all token embeddings from model output and averaging them at sentence/paragraph level.
1109
+ It has been proved to be the most effective way to produce high-quality sentence embeddings.
1110
+ We offer an `encode` function to deal with this.
1111
 
1112
+ However, if you would like to do it without using the default `encode` function:
1113
 
1114
+ ```python
1115
+ import torch
1116
+ import torch.nn.functional as F
1117
+ from transformers import AutoTokenizer, AutoModel
1118
 
1119
+ def mean_pooling(model_output, attention_mask):
1120
+ token_embeddings = model_output[0]
1121
+ input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
1122
+ return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
1123
+
1124
+ sentences = ['How is the weather today?', 'What is the current weather like today?']
1125
+
1126
+ tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v2-small-en')
1127
+ model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-small-en', trust_remote_code=True)
1128
+
1129
+ encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
1130
+
1131
+ with torch.no_grad():
1132
+ model_output = model(**encoded_input)
1133
+
1134
+ embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
1135
+ embeddings = F.normalize(embeddings, p=2, dim=1)
1136
  ```
 
 
 
 
 
 
 
1137
 
1138
+ </p>
1139
+ </details>
1140
+
1141
+ You can use Jina Embedding models directly from transformers package:
1142
+ ```python
1143
+ !pip install transformers
1144
+ from transformers import AutoModel
1145
+ from numpy.linalg import norm
1146
+
1147
+ cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
1148
+ model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True) # trust_remote_code is needed to use the encode method
1149
+ embeddings = model.encode(['How is the weather today?', 'What is the current weather like today?'])
1150
+ print(cos_sim(embeddings[0], embeddings[1]))
1151
+ ```
1152
+
1153
+ If you only want to handle shorter sequence, such as 2k, pass the `max_length` parameter to the `encode` function:
1154
+
1155
+ ```python
1156
+ embeddings = model.encode(
1157
+ ['Very long ... document'],
1158
+ max_length=2048
1159
+ )
1160
  ```
1161
 
1162
+ ## Fully-managed Embeddings Service
1163
+
1164
+ Alternatively, you can use Jina AI's [Embedding platform](https://jina.ai/embeddings/) for fully-managed access to Jina Embeddings models.
1165
+
1166
+ ## Use Jina Embeddings for RAG
1167
+
1168
+ According to the latest blog post from [LLamaIndex](https://blog.llamaindex.ai/boosting-rag-picking-the-best-embedding-reranker-models-42d079022e83),
1169
 
1170
+ > In summary, to achieve the peak performance in both hit rate and MRR, the combination of OpenAI or JinaAI-Base embeddings with the CohereRerank/bge-reranker-large reranker stands out.
1171
 
1172
+ <img src="https://miro.medium.com/v2/resize:fit:4800/format:webp/1*ZP2RVejCZovF3FDCg-Bx3A.png" width="780px">
1173
+
1174
+
1175
+ ## Plans
1176
+
1177
+ 1. Bilingual embedding models supporting more European & Asian languages, including Spanish, French, Italian and Japanese.
1178
+ 2. Multimodal embedding models enable MultimodalRAG applications.
1179
+ 3. High-performt rerankers.
1180
 
1181
  ## Contact
1182
 
1183
+ Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas.
1184
+
1185
+ ## Citation
1186
+
1187
+ If you find Jina Embeddings useful in your research, please cite the following paper:
1188
+
1189
+ ```
1190
+ @misc{günther2023jina,
1191
+ title={Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents},
1192
+ author={Michael Günther and Jackmin Ong and Isabelle Mohr and Alaeddine Abdessalem and Tanguy Abel and Mohammad Kalim Akram and Susana Guzman and Georgios Mastrapas and Saba Sturua and Bo Wang and Maximilian Werk and Nan Wang and Han Xiao},
1193
+ year={2023},
1194
+ eprint={2310.19923},
1195
+ archivePrefix={arXiv},
1196
+ primaryClass={cs.CL}
1197
+ }
1198
+ ```