ldwang commited on
Commit
28fa9e5
β€’
1 Parent(s): 97d7bcd
Files changed (1) hide show
  1. README.md +9 -6
README.md CHANGED
@@ -2605,12 +2605,15 @@ pipeline_tag: sentence-similarity
2605
  ---
2606
 
2607
 
 
 
2608
  <h1 align="center">FlagEmbedding</h1>
2609
 
2610
 
2611
  <h4 align="center">
2612
  <p>
2613
  <a href=#model-list>Model List</a> |
 
2614
  <a href=#usage>Usage</a> |
2615
  <a href="#evaluation">Evaluation</a> |
2616
  <a href="#train">Train</a> |
@@ -2630,8 +2633,8 @@ And it also can be used in vector databases for LLMs.
2630
 
2631
  ************* 🌟**Updates**🌟 *************
2632
  - 09/12/2023: New Release:
2633
- - **New reranker model**: release a cross-encoder model bge-reranker-base, which is more powerful than embedding model. We recommend to use/fine-tune it to re-rank top-k documents returned by embedding models.
2634
- - **update embedding model**: release bge-*-v1.5 embedding model to alleviate the issue of the similarity distribution, and enhance its retrieval ability without instruction.
2635
  - 09/07/2023: Update [fine-tune code](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md): Add script to mine hard negatives and support adding instruction during fine-tuning.
2636
  - 08/09/2023: BGE Models are integrated into **Langchain**, you can use it like [this](#using-langchain); C-MTEB **leaderboard** is [available](https://huggingface.co/spaces/mteb/leaderboard).
2637
  - 08/05/2023: Release base-scale and small-scale models, **best performance among the models of the same size πŸ€—**
@@ -2663,7 +2666,7 @@ And it also can be used in vector databases for LLMs.
2663
 
2664
  \*: If you need to search the relevant passages to a query, we suggest to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. In all cases, **no instruction** needs to be added to passages.
2665
 
2666
- \**: To balance the accuracy and time cost, cross-encoder is widely used to re-rank top-k documents retrieved by other simple models.
2667
  For examples, use bge embedding model to retrieve top 100 relevant documents, and then use bge reranker to re-rank the top 100 document to get the final top-3 results.
2668
 
2669
 
@@ -2675,7 +2678,7 @@ For examples, use bge embedding model to retrieve top 100 relevant documents, an
2675
  <!-- ### How to fine-tune bge embedding model? -->
2676
  Following this [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) to prepare data and fine-tune your model.
2677
  Some suggestions:
2678
- - Mine hard negatives following this [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune#data-format), which can improve the retrieval performance.
2679
  - If you pre-train bge on your data, the pre-trained model cannot be directly used to calculate similarity, and it must be fine-tuned with contrastive learning before computing similarity.
2680
  - If the accuracy of the fine-tuned model is still not high, it is recommended to use/fine-tune the cross-encoder model (bge-reranker) to re-rank top-k results. Hard negatives also are needed to fine-tune reranker.
2681
 
@@ -2959,8 +2962,8 @@ Cross-encoder will perform full-attention over the input pair,
2959
  which is more accurate than embedding model (i.e., bi-encoder) but more time-consuming than embedding model.
2960
  Therefore, it can be used to re-rank the top-k documents returned by embedding model.
2961
  We train the cross-encoder on a multilingual pair data,
2962
- The data format is the same as embedding model, so you can fine-tune it easily following our example.
2963
- More details pelease refer to [./FlagEmbedding/reranker/README.md](./FlagEmbedding/reranker/README.md)
2964
 
2965
 
2966
  ## Contact
 
2605
  ---
2606
 
2607
 
2608
+ **Recommend switching to newest bge-large-en-v1.5, which has more reasonable similarity distribution and same method of usage.**
2609
+
2610
  <h1 align="center">FlagEmbedding</h1>
2611
 
2612
 
2613
  <h4 align="center">
2614
  <p>
2615
  <a href=#model-list>Model List</a> |
2616
+ <a href=#frequently-asked-questions>FAQ</a> |
2617
  <a href=#usage>Usage</a> |
2618
  <a href="#evaluation">Evaluation</a> |
2619
  <a href="#train">Train</a> |
 
2633
 
2634
  ************* 🌟**Updates**🌟 *************
2635
  - 09/12/2023: New Release:
2636
+ - **New reranker model**: release cross-encoder models `BAAI/bge-reranker-base` and `BAAI/bge-reranker-large`, which are more powerful than embedding model. We recommend to use/fine-tune them to re-rank top-k documents returned by embedding models.
2637
+ - **update embedding model**: release `bge-*-v1.5` embedding model to alleviate the issue of the similarity distribution, and enhance its retrieval ability without instruction.
2638
  - 09/07/2023: Update [fine-tune code](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md): Add script to mine hard negatives and support adding instruction during fine-tuning.
2639
  - 08/09/2023: BGE Models are integrated into **Langchain**, you can use it like [this](#using-langchain); C-MTEB **leaderboard** is [available](https://huggingface.co/spaces/mteb/leaderboard).
2640
  - 08/05/2023: Release base-scale and small-scale models, **best performance among the models of the same size πŸ€—**
 
2666
 
2667
  \*: If you need to search the relevant passages to a query, we suggest to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. In all cases, **no instruction** needs to be added to passages.
2668
 
2669
+ \**: Different embedding model, reranker is a cross-encoder, which cannot be used to generate embedding. To balance the accuracy and time cost, cross-encoder is widely used to re-rank top-k documents retrieved by other simple models.
2670
  For examples, use bge embedding model to retrieve top 100 relevant documents, and then use bge reranker to re-rank the top 100 document to get the final top-3 results.
2671
 
2672
 
 
2678
  <!-- ### How to fine-tune bge embedding model? -->
2679
  Following this [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) to prepare data and fine-tune your model.
2680
  Some suggestions:
2681
+ - Mine hard negatives following this [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune#hard-negatives), which can improve the retrieval performance.
2682
  - If you pre-train bge on your data, the pre-trained model cannot be directly used to calculate similarity, and it must be fine-tuned with contrastive learning before computing similarity.
2683
  - If the accuracy of the fine-tuned model is still not high, it is recommended to use/fine-tune the cross-encoder model (bge-reranker) to re-rank top-k results. Hard negatives also are needed to fine-tune reranker.
2684
 
 
2962
  which is more accurate than embedding model (i.e., bi-encoder) but more time-consuming than embedding model.
2963
  Therefore, it can be used to re-rank the top-k documents returned by embedding model.
2964
  We train the cross-encoder on a multilingual pair data,
2965
+ The data format is the same as embedding model, so you can fine-tune it easily following our [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/reranker).
2966
+ More details pelease refer to [./FlagEmbedding/reranker/README.md](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/reranker)
2967
 
2968
 
2969
  ## Contact