intfloat commited on
Commit
1292863
1 Parent(s): 1881441

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +53 -3
README.md CHANGED
@@ -6813,7 +6813,7 @@ batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=Tru
6813
  outputs = model(**batch_dict)
6814
  embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
6815
 
6816
- # (Optionally) normalize embeddings
6817
  embeddings = F.normalize(embeddings, p=2, dim=1)
6818
  scores = (embeddings[:2] @ embeddings[2:].T) * 100
6819
  print(scores.tolist())
@@ -6865,11 +6865,61 @@ For all labeled datasets, we only use its training set for fine-tuning.
6865
 
6866
  For other training details, please refer to our paper at [https://arxiv.org/pdf/2212.03533.pdf](https://arxiv.org/pdf/2212.03533.pdf).
6867
 
6868
- ## Benchmark Evaluation
 
 
 
 
 
 
 
 
 
 
 
 
6869
 
6870
  Check out [unilm/e5](https://github.com/microsoft/unilm/tree/master/e5) to reproduce evaluation results
6871
  on the [BEIR](https://arxiv.org/abs/2104.08663) and [MTEB benchmark](https://arxiv.org/abs/2210.07316).
6872
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6873
  ## Citation
6874
 
6875
  If you find our paper or models helpful, please consider cite as follows:
@@ -6885,4 +6935,4 @@ If you find our paper or models helpful, please consider cite as follows:
6885
 
6886
  ## Limitations
6887
 
6888
- Long texts will be truncated to at most 512 tokens.
 
6813
  outputs = model(**batch_dict)
6814
  embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
6815
 
6816
+ # normalize embeddings
6817
  embeddings = F.normalize(embeddings, p=2, dim=1)
6818
  scores = (embeddings[:2] @ embeddings[2:].T) * 100
6819
  print(scores.tolist())
 
6865
 
6866
  For other training details, please refer to our paper at [https://arxiv.org/pdf/2212.03533.pdf](https://arxiv.org/pdf/2212.03533.pdf).
6867
 
6868
+ ## Benchmark Results on [Mr. TyDi](https://arxiv.org/abs/2108.08787)
6869
+
6870
+ | Model | Avg MRR@10 | | ar | bn | en | fi | id | ja | ko | ru | sw | te | th |
6871
+ |-----------------------|------------|-------|------| --- | --- | --- | --- | --- | --- | --- |------| --- | --- |
6872
+ | BM25 | 33.3 | | 36.7 | 41.3 | 15.1 | 28.8 | 38.2 | 21.7 | 28.1 | 32.9 | 39.6 | 42.4 | 41.7 |
6873
+ | mDPR | 16.7 | | 26.0 | 25.8 | 16.2 | 11.3 | 14.6 | 18.1 | 21.9 | 18.5 | 7.3 | 10.6 | 13.5 |
6874
+ | BM25 + mDPR | 41.7 | | 49.1 | 53.5 | 28.4 | 36.5 | 45.5 | 35.5 | 36.2 | 42.7 | 40.5 | 42.0 | 49.2 |
6875
+ | | |
6876
+ | multilingual-e5-small | 64.4 | | 71.5 | 66.3 | 54.5 | 57.7 | 63.2 | 55.4 | 54.3 | 60.8 | 65.4 | 89.1 | 70.1 |
6877
+ | multilingual-e5-base | 65.9 | | 72.3 | 65.0 | 58.5 | 60.8 | 64.9 | 56.6 | 55.8 | 62.7 | 69.0 | 86.6 | 72.7 |
6878
+ | multilingual-e5-large | **70.5** | | 77.5 | 73.2 | 60.8 | 66.8 | 68.5 | 62.5 | 61.6 | 65.8 | 72.7 | 90.2 | 76.2 |
6879
+
6880
+ ## MTEB Benchmark Evaluation
6881
 
6882
  Check out [unilm/e5](https://github.com/microsoft/unilm/tree/master/e5) to reproduce evaluation results
6883
  on the [BEIR](https://arxiv.org/abs/2104.08663) and [MTEB benchmark](https://arxiv.org/abs/2210.07316).
6884
 
6885
+ ## Support for Sentence Transformers
6886
+
6887
+ Below is an example for usage with sentence_transformers.
6888
+ ```python
6889
+ from sentence_transformers import SentenceTransformer
6890
+ model = SentenceTransformer('intfloat/multilingual-e5-base')
6891
+ input_texts = [
6892
+ 'query: how much protein should a female eat',
6893
+ 'query: 南瓜的家常做法',
6894
+ "passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 i s 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or traini ng for a marathon. Check out the chart below to see how much protein you should be eating each day.",
6895
+ "passage: 1.清炒南瓜丝 原料:嫩南瓜半个 调料:葱、盐、白糖、鸡精 做法: 1、南瓜用刀薄薄的削去表面一层皮 ,用勺子刮去瓤 2、擦成细丝(没有擦菜板就用刀慢慢切成细丝) 3、锅烧热放油,入葱花煸出香味 4、入南瓜丝快速翻炒一分钟左右, 放盐、一点白糖和鸡精调味出锅 2.香葱炒南瓜 原料:南瓜1只 调料:香葱、蒜末、橄榄油、盐 做法: 1、将南瓜去皮,切成片 2、油 锅8成热后,将蒜末放入爆香 3、爆香后,将南瓜片放入,翻炒 4、在翻炒的同时,可以不时地往锅里加水,但不要太多 5、放入盐,炒匀 6、南瓜差不多软和绵了之后,就可以关火 7、撒入香葱,即可出锅"
6896
+ ]
6897
+ embeddings = model.encode(input_texts, normalize_embeddings=True)
6898
+ ```
6899
+
6900
+ Package requirements
6901
+
6902
+ `pip install sentence_transformers~=2.2.2`
6903
+
6904
+ Contributors: [michaelfeil](https://huggingface.co/michaelfeil)
6905
+
6906
+ ## FAQ
6907
+
6908
+ **1. Do I need to add the prefix "query: " and "passage: " to input texts?**
6909
+
6910
+ Yes, this is how the model is trained, otherwise you will see a performance degradation.
6911
+
6912
+ Here are some rules of thumb:
6913
+ - Use "query: " and "passage: " correspondingly for asymmetric tasks such as passage retrieval in open QA, ad-hoc information retrieval.
6914
+
6915
+ - Use "query: " prefix for symmetric tasks such as semantic similarity, bitext mining, paraphrase retrieval.
6916
+
6917
+ - Use "query: " prefix if you want to use embeddings as features, such as linear probing classification, clustering.
6918
+
6919
+ **2. Why are my reproduced results slightly different from reported in the model card?**
6920
+
6921
+ Different versions of `transformers` and `pytorch` could cause negligible but non-zero performance differences.
6922
+
6923
  ## Citation
6924
 
6925
  If you find our paper or models helpful, please consider cite as follows:
 
6935
 
6936
  ## Limitations
6937
 
6938
+ Long texts will be truncated to at most 512 tokens.