Update README.md
Browse files
README.md
CHANGED
@@ -2762,7 +2762,7 @@ More details and evaluation scripts see [benchemark](benchmark/README.md).
|
|
2762 |
|
2763 |
|
2764 |
- **C-MTEB**:
|
2765 |
-
We create a benchmark C-MTEB for
|
2766 |
Please refer to [benchemark](benchmark/README.md) for a detailed introduction.
|
2767 |
|
2768 |
| Model | Embedding dimension | Avg | Retrieval | STS | PairClassification | Classification | Reranking | Clustering |
|
@@ -2783,15 +2783,15 @@ Please refer to [benchemark](benchmark/README.md) for a detailed introduction.
|
|
2783 |
|
2784 |
## Train
|
2785 |
This section will introduce the way we used to train the general embedding.
|
2786 |
-
The training scripts are in [flag_embedding](
|
2787 |
-
and we provide some examples to do [pre-train](examples/pretrain/
|
2788 |
|
2789 |
|
2790 |
**1. RetroMAE Pre-train**
|
2791 |
We pre-train the model following the method [retromae](https://github.com/staoxiao/RetroMAE),
|
2792 |
which shows promising improvement in retrieval task ([paper](https://aclanthology.org/2022.emnlp-main.35.pdf)).
|
2793 |
The pre-training was conducted on 24 A100(40G) GPUs with a batch size of 720.
|
2794 |
-
In retromae, the mask ratio of encoder and decoder are 0.3, 0.5 respectively.
|
2795 |
We used the AdamW optimizer and the learning rate is 2e-5.
|
2796 |
|
2797 |
**Pre-training data**:
|
@@ -2808,49 +2808,36 @@ We used the AdamW optimizer and the learning rate is 2e-5.
|
|
2808 |
We fine-tune the model using a contrastive objective.
|
2809 |
The format of input data is a triple`(query, positive, negative)`.
|
2810 |
Besides the negative in the triple, we also adopt in-batch negatives strategy.
|
2811 |
-
We employ the cross-device negatives sharing method to
|
2812 |
which can dramatically **increase the number of negatives**.
|
2813 |
|
2814 |
We trained our model on 48 A100(40G) GPUs with a large batch size of 32,768 (so there are **65,535** negatives for each query in a batch).
|
2815 |
We used the AdamW optimizer and the learning rate is 1e-5.
|
2816 |
The temperature for contrastive loss is 0.01.
|
2817 |
|
2818 |
-
For the version with `*-instrcution`, we add instruction to the query for retrieval task in the training.
|
2819 |
-
For
|
2820 |
-
For
|
2821 |
In the evaluation, the instruction should be added for sentence to passages retrieval task, not be added for other tasks.
|
2822 |
|
2823 |
|
2824 |
-
The finetune script is accessible in this repository: [flag_embedding](
|
2825 |
You can easily finetune your model with it.
|
2826 |
|
2827 |
**Training data**:
|
2828 |
|
2829 |
- For English, we collect 230M text pairs from [wikipedia](https://huggingface.co/datasets/wikipedia), [cc-net](https://github.com/facebookresearch/cc_net), and so on.
|
2830 |
|
2831 |
-
- For
|
2832 |
|
2833 |
**The data collection is to be released in the future.**
|
2834 |
|
2835 |
-
## Schedule
|
2836 |
-
- [x] Chinese Massive Text Embedding Benchmark
|
2837 |
-
- [x] release baai-general-embedding models
|
2838 |
-
- [x] release codes for training
|
2839 |
-
- [ ] Training Datasets
|
2840 |
-
- [ ] Multilingual model
|
2841 |
-
- [ ] ...
|
2842 |
-
|
2843 |
We will continually update the embedding models and training codes,
|
2844 |
hoping to promote the development of the embedding model community.
|
2845 |
|
2846 |
|
2847 |
-
## Contact
|
2848 |
-
If you have any question or suggestion related to this project, feel free to open an issue or pull a request.
|
2849 |
-
You also can email Shitao Xiao(stxiao@baai.ac.cn) and Zheng Liu(liuzheng@baai.ac.cn).
|
2850 |
-
|
2851 |
-
|
2852 |
## License
|
2853 |
-
FlagEmbedding is licensed under [MIT License](
|
2854 |
|
2855 |
|
2856 |
|
|
|
2762 |
|
2763 |
|
2764 |
- **C-MTEB**:
|
2765 |
+
We create a benchmark C-MTEB for Chinese text embedding which consists of 31 datasets from 6 tasks.
|
2766 |
Please refer to [benchemark](benchmark/README.md) for a detailed introduction.
|
2767 |
|
2768 |
| Model | Embedding dimension | Avg | Retrieval | STS | PairClassification | Classification | Reranking | Clustering |
|
|
|
2783 |
|
2784 |
## Train
|
2785 |
This section will introduce the way we used to train the general embedding.
|
2786 |
+
The training scripts are in [flag_embedding](https://github.com/FlagOpen/FlagEmbedding/tree/master/flag_embedding/baai_general_embedding/),
|
2787 |
+
and we provide some examples to do [pre-train](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/pretrain/) and [fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune).
|
2788 |
|
2789 |
|
2790 |
**1. RetroMAE Pre-train**
|
2791 |
We pre-train the model following the method [retromae](https://github.com/staoxiao/RetroMAE),
|
2792 |
which shows promising improvement in retrieval task ([paper](https://aclanthology.org/2022.emnlp-main.35.pdf)).
|
2793 |
The pre-training was conducted on 24 A100(40G) GPUs with a batch size of 720.
|
2794 |
+
In retromae, the mask ratio of the encoder and decoder are 0.3, and 0.5 respectively.
|
2795 |
We used the AdamW optimizer and the learning rate is 2e-5.
|
2796 |
|
2797 |
**Pre-training data**:
|
|
|
2808 |
We fine-tune the model using a contrastive objective.
|
2809 |
The format of input data is a triple`(query, positive, negative)`.
|
2810 |
Besides the negative in the triple, we also adopt in-batch negatives strategy.
|
2811 |
+
We employ the cross-device negatives sharing method to share negatives among different GPUs,
|
2812 |
which can dramatically **increase the number of negatives**.
|
2813 |
|
2814 |
We trained our model on 48 A100(40G) GPUs with a large batch size of 32,768 (so there are **65,535** negatives for each query in a batch).
|
2815 |
We used the AdamW optimizer and the learning rate is 1e-5.
|
2816 |
The temperature for contrastive loss is 0.01.
|
2817 |
|
2818 |
+
For the version with `*-instrcution`, we add instruction to the query for the retrieval task in the training.
|
2819 |
+
For English, the instruction is `Represent this sentence for searching relevant passages: `;
|
2820 |
+
For Chinese, the instruction is `为这个句子生成表示以用于检索相关文章:`.
|
2821 |
In the evaluation, the instruction should be added for sentence to passages retrieval task, not be added for other tasks.
|
2822 |
|
2823 |
|
2824 |
+
The finetune script is accessible in this repository: [flag_embedding](https://github.com/FlagOpen/FlagEmbedding/tree/master/flag_embedding/baai_general_embedding/README.md).
|
2825 |
You can easily finetune your model with it.
|
2826 |
|
2827 |
**Training data**:
|
2828 |
|
2829 |
- For English, we collect 230M text pairs from [wikipedia](https://huggingface.co/datasets/wikipedia), [cc-net](https://github.com/facebookresearch/cc_net), and so on.
|
2830 |
|
2831 |
+
- For Chinese, we collect 120M text pairs from [wudao](https://github.com/BAAI-WuDao/Data), [simclue](https://github.com/CLUEbenchmark/SimCLUE) and so on.
|
2832 |
|
2833 |
**The data collection is to be released in the future.**
|
2834 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2835 |
We will continually update the embedding models and training codes,
|
2836 |
hoping to promote the development of the embedding model community.
|
2837 |
|
2838 |
|
|
|
|
|
|
|
|
|
|
|
2839 |
## License
|
2840 |
+
FlagEmbedding is licensed under [MIT License](). The released models can be used for commercial purposes free of charge.
|
2841 |
|
2842 |
|
2843 |
|