izhx commited on
Commit
66c771d
1 Parent(s): 49b4cb7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +161 -0
README.md CHANGED
@@ -1,3 +1,164 @@
1
  ---
2
  license: bigscience-bloom-rail-1.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: bigscience-bloom-rail-1.0
3
  ---
4
+
5
+ # Model Card for udever-bloom
6
+
7
+ <!-- Provide a quick summary of what the model is/does. -->
8
+
9
+ `udever-bloom-560m` is finetuned from [bigscience/bloom-560m](https://huggingface.co/bigscience/bloom-560m) via [BitFit](https://aclanthology.org/2022.acl-short.1/) on MS MARCO Passage Ranking, SNLI and MultiNLI data.
10
+ It is a universal embedding model across tasks, natural and programming languages.
11
+ (From a technical view, `udever` is merely with some minor improvements to `sgpt-bloom`)
12
+
13
+ <div align=center><img width="338" height="259" src="https://user-images.githubusercontent.com/26690193/277643721-cdb7f227-cae5-40e1-b6e1-a201bde00339.png" /></div>
14
+
15
+
16
+ ## Model Details
17
+
18
+ ### Model Description
19
+
20
+ - **Developed by:** Alibaba Group
21
+ - **Model type:** Transformer-based Language Model (decoder-only)
22
+ - **Language(s) (NLP):** Multiple; see [bloom training data](https://huggingface.co/bigscience/bloom-560m#training-data)
23
+ - **Finetuned from model :** [bigscience/bloom-560m](https://huggingface.co/bigscience/bloom-560m)
24
+
25
+ ### Model Sources
26
+
27
+ <!-- Provide the basic links for the model. -->
28
+
29
+ - **Repository:** [github.com/izhx/uni-rep](https://github.com/izhx/uni-rep)
30
+ - **Paper :** [Language Models are Universal Embedders](https://arxiv.org/pdf/2310.08232.pdf)
31
+
32
+
33
+
34
+ ## How to Get Started with the Model
35
+
36
+ Use the code below to get started with the model.
37
+
38
+ ```python
39
+
40
+ ```
41
+
42
+ ## Training Details
43
+
44
+ ### Training Data
45
+
46
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
47
+
48
+ - MS MARCO Passage Ranking, retrieved by (https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/ms_marco/train_bi-encoder_mnrl.py#L86)
49
+ - SNLI and MultiNLI (https://sbert.net/datasets/AllNLI.tsv.gz)
50
+
51
+
52
+ ### Training Procedure
53
+
54
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
55
+
56
+ #### Preprocessing [optional]
57
+
58
+ MS MARCO hard negatives provided by (https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/ms_marco/train_bi-encoder_mnrl.py#L86).
59
+ Negatives for SNLI and MultiNLI are randomly sampled.
60
+
61
+
62
+ #### Training Hyperparameters
63
+
64
+ - **Training regime:** tf32, BitFit
65
+ - **Batch size:** 1024
66
+ - **Epochs:** 3
67
+ - **Optimizer:** AdamW
68
+ - **Learning rate:** 1e-4
69
+ - **Scheduler:** constant with warmup.
70
+ - **Warmup:** 0.25 epoch
71
+
72
+
73
+ ## Evaluation
74
+
75
+ ### Table 1: Massive Text Embedding Benchmark [MTEB](https://arxiv.org/abs/2210.07316)
76
+
77
+ | MTEB | Avg. | Class. | Clust. | PairClass. | Rerank. | Retr. | STS | Summ. |
78
+ |-----------------------------|--------------|--------------|--------------|--------------|--------------|--------------|--------------|--------|
79
+ | #Datasets ➡️ | 56 | 12 | 11 | 3 | 4 | 15 | 10 | 1 |
80
+ ||
81
+ | bge-large-en-v1.5 | **64.23** | **75.97** | 46.08| **87.12** | **60.03** | **54.29** | 83.11| 31.61 |
82
+ | bge-base-en-v1.5 | 63.55| 75.53| 45.77| 86.55| 58.86| 53.25| 82.4| 31.07 |
83
+ | gte-large | 63.13| 73.33| **46.84** | 85| 59.13| 52.22| **83.35** | 31.66 |
84
+ | gte-base | 62.39| 73.01| 46.2| 84.57| 58.61| 51.14| 82.3| 31.17 |
85
+ | e5-large-v2 | 62.25| 75.24| 44.49| 86.03| 56.61| 50.56| 82.05| 30.19 |
86
+ | instructor-xl | 61.79| 73.12| 44.74| 86.62| 57.29| 49.26| 83.06| 32.32 |
87
+ | instructor-large | 61.59| 73.86| 45.29| 85.89| 57.54| 47.57| 83.15| 31.84 |
88
+ | e5-base-v2 | 61.5 | 73.84| 43.8| 85.73| 55.91| 50.29| 81.05| 30.28 |
89
+ | e5-large | 61.42| 73.14| 43.33| 85.94| 56.53| 49.99| 82.06| 30.97 |
90
+ | text-embedding-ada-002 (OpenAI API) | 60.99| 70.93| 45.9 | 84.89| 56.32| 49.25| 80.97| 30.8 |
91
+ | e5-base | 60.44| 72.63| 42.11| 85.09| 55.7 | 48.75| 80.96| 31.01 |
92
+ | SGPT-5.8B-msmarco | 58.93| 68.13| 40.34| 82 | 56.56| 50.25| 78.1 | 31.46 |
93
+ | sgpt-bloom-7b1-msmarco | 57.59| 66.19| 38.93| 81.9 | 55.65| 48.22| 77.74| **33.6** |
94
+ ||
95
+ | Udever-bloom-560m | 55.80| 68.04| 36.89| 81.05| 52.60| 41.19| 79.93| 32.06 |
96
+ | Udever-bloom-1b1 | 58.28| 70.18| 39.11| 83.11| 54.28| 45.27| 81.52| 31.10 |
97
+ | Udever-bloom-3b | 59.86| 71.91| 40.74| 84.06| 54.90| 47.67| 82.37| 30.62 |
98
+ | Udever-bloom-7b1 | 60.63 | 72.13| 40.81| 85.40| 55.91| 49.34| 83.01| 30.97 |
99
+
100
+
101
+ ### Table 2: [CodeSearchNet](https://arxiv.org/abs/1909.09436)
102
+
103
+ | CodeSearchNet | Go | Ruby | Python | Java | JS | PHP | Avg. |
104
+ |-|-|-|-|-|-|-|-|
105
+ | CodeBERT | 69.3 | 70.6 | 84.0 | 86.8 | 74.8 | 70.6 | 76.0 |
106
+ | GraphCodeBERT | 84.1 | 73.2 | 87.9 | 75.7 | 71.1 | 72.5 | 77.4 |
107
+ | cpt-code S | **97.7** | **86.3** | 99.8 | 94.0 | 86.0 | 96.7 | 93.4 |
108
+ | cpt-code M | 97.5 | 85.5 | **99.9** | **94.4** | **86.5** | **97.2** | **93.5** |
109
+ | sgpt-bloom-7b1-msmarco | 76.79 | 69.25 | 95.68 | 77.93 | 70.35 | 73.45 | 77.24 |
110
+ ||
111
+ | Udever-bloom-560m | 75.38 | 66.67 | 96.23 | 78.99 | 69.39 | 73.69 | 76.73 |
112
+ | Udever-bloom-1b1 | 78.76 | 72.85 | 97.67 | 82.77 | 74.38 | 78.97 | 80.90 |
113
+ | Udever-bloom-3b | 80.63 | 75.40 | 98.02 | 83.88 | 76.18 | 79.67 | 82.29 |
114
+ | Udever-bloom-7b1 | 79.37 | 76.59 | 98.38 | 84.68 | 77.49 | 80.03 | 82.76 |
115
+
116
+
117
+ ### Table 3: Chinese multi-domain retrieval [Multi-cpr](https://dl.acm.org/doi/10.1145/3477495.3531736)
118
+
119
+ | | | |E-commerce | | Entertainment video | | Medical | |
120
+ |--|--|--|--|--|--|--|--|--|
121
+ | Model | Train | Backbone | MRR@10 | Recall@1k | MRR@10 | Recall@1k | MRR@10 | Recall@1k |
122
+ ||
123
+ | BM25 | - | - | 0.225 | 0.815 | 0.225 | 0.780 | 0.187 | 0.482 |
124
+ | Doc2Query | - | - | 0.239 | 0.826 | 0.238 | 0.794 | 0.210 | 0.505 |
125
+ | DPR-1 | In-Domain | BERT | 0.270 | 0.921 | 0.254 | 0.934 | 0.327 | 0.747 |
126
+ | DPR-2 | In-Domain | BERT-CT | 0.289 | **0.926** | 0.263 | **0.935** | 0.339 | **0.769** |
127
+ | text-embedding-ada-002 | General | GPT | 0.183 | 0.825 | 0.159 | 0.786 | 0.245 | 0.593 |
128
+ | sgpt-bloom-7b1-msmarco | General | BLOOM | 0.242 | 0.840 | 0.227 | 0.829 | 0.311 | 0.675 |
129
+ ||
130
+ | Udever-bloom-560m | General | BLOOM | 0.156 | 0.802 | 0.149 | 0.749 | 0.245 | 0.571 |
131
+ | Udever-bloom-1b1 | General | BLOOM | 0.244 | 0.863 | 0.208 | 0.815 | 0.241 | 0.557 |
132
+ | Udever-bloom-3b | General | BLOOM | 0.267 | 0.871 | 0.228 | 0.836 | 0.288 | 0.619 |
133
+ | Udever-bloom-7b1 | General | BLOOM | **0.296** | 0.889 | **0.267** | 0.907 | **0.343** | 0.705 |
134
+
135
+ #### More results refer to [paper](https://arxiv.org/pdf/2310.08232.pdf) section 3.
136
+
137
+
138
+
139
+ ## Technical Specifications
140
+
141
+ ### Model Architecture and Objective
142
+
143
+ - Model: [bigscience/bloom-560m](https://huggingface.co/bigscience/bloom-560m).
144
+ - Objective: Constrastive loss with hard negatives (refer to [paper](https://arxiv.org/pdf/2310.08232.pdf) section 2.2).
145
+
146
+
147
+ ### Compute Infrastructure
148
+
149
+ - Nvidia A100 SXM4 80GB.
150
+ - torch 2.0.0, transformers 4.29.2.
151
+
152
+
153
+ ## Citation
154
+
155
+ **BibTeX:**
156
+
157
+ ```BibTeX
158
+ @article{zhang2023language,
159
+ title={Language Models are Universal Embedders},
160
+ author={Zhang, Xin and Li, Zehan and Zhang, Yanzhao and Long, Dingkun and Xie, Pengjun and Zhang, Meishan and Zhang, Min},
161
+ journal={arXiv preprint arXiv:2310.08232},
162
+ year={2023}
163
+ }
164
+ ```