Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,164 @@
|
|
1 |
---
|
2 |
license: bigscience-bloom-rail-1.0
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: bigscience-bloom-rail-1.0
|
3 |
---
|
4 |
+
|
5 |
+
# Model Card for udever-bloom
|
6 |
+
|
7 |
+
<!-- Provide a quick summary of what the model is/does. -->
|
8 |
+
|
9 |
+
`udever-bloom-560m` is finetuned from [bigscience/bloom-560m](https://huggingface.co/bigscience/bloom-560m) via [BitFit](https://aclanthology.org/2022.acl-short.1/) on MS MARCO Passage Ranking, SNLI and MultiNLI data.
|
10 |
+
It is a universal embedding model across tasks, natural and programming languages.
|
11 |
+
(From a technical view, `udever` is merely with some minor improvements to `sgpt-bloom`)
|
12 |
+
|
13 |
+
<div align=center><img width="338" height="259" src="https://user-images.githubusercontent.com/26690193/277643721-cdb7f227-cae5-40e1-b6e1-a201bde00339.png" /></div>
|
14 |
+
|
15 |
+
|
16 |
+
## Model Details
|
17 |
+
|
18 |
+
### Model Description
|
19 |
+
|
20 |
+
- **Developed by:** Alibaba Group
|
21 |
+
- **Model type:** Transformer-based Language Model (decoder-only)
|
22 |
+
- **Language(s) (NLP):** Multiple; see [bloom training data](https://huggingface.co/bigscience/bloom-560m#training-data)
|
23 |
+
- **Finetuned from model :** [bigscience/bloom-560m](https://huggingface.co/bigscience/bloom-560m)
|
24 |
+
|
25 |
+
### Model Sources
|
26 |
+
|
27 |
+
<!-- Provide the basic links for the model. -->
|
28 |
+
|
29 |
+
- **Repository:** [github.com/izhx/uni-rep](https://github.com/izhx/uni-rep)
|
30 |
+
- **Paper :** [Language Models are Universal Embedders](https://arxiv.org/pdf/2310.08232.pdf)
|
31 |
+
|
32 |
+
|
33 |
+
|
34 |
+
## How to Get Started with the Model
|
35 |
+
|
36 |
+
Use the code below to get started with the model.
|
37 |
+
|
38 |
+
```python
|
39 |
+
|
40 |
+
```
|
41 |
+
|
42 |
+
## Training Details
|
43 |
+
|
44 |
+
### Training Data
|
45 |
+
|
46 |
+
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
|
47 |
+
|
48 |
+
- MS MARCO Passage Ranking, retrieved by (https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/ms_marco/train_bi-encoder_mnrl.py#L86)
|
49 |
+
- SNLI and MultiNLI (https://sbert.net/datasets/AllNLI.tsv.gz)
|
50 |
+
|
51 |
+
|
52 |
+
### Training Procedure
|
53 |
+
|
54 |
+
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
|
55 |
+
|
56 |
+
#### Preprocessing [optional]
|
57 |
+
|
58 |
+
MS MARCO hard negatives provided by (https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/ms_marco/train_bi-encoder_mnrl.py#L86).
|
59 |
+
Negatives for SNLI and MultiNLI are randomly sampled.
|
60 |
+
|
61 |
+
|
62 |
+
#### Training Hyperparameters
|
63 |
+
|
64 |
+
- **Training regime:** tf32, BitFit
|
65 |
+
- **Batch size:** 1024
|
66 |
+
- **Epochs:** 3
|
67 |
+
- **Optimizer:** AdamW
|
68 |
+
- **Learning rate:** 1e-4
|
69 |
+
- **Scheduler:** constant with warmup.
|
70 |
+
- **Warmup:** 0.25 epoch
|
71 |
+
|
72 |
+
|
73 |
+
## Evaluation
|
74 |
+
|
75 |
+
### Table 1: Massive Text Embedding Benchmark [MTEB](https://arxiv.org/abs/2210.07316)
|
76 |
+
|
77 |
+
| MTEB | Avg. | Class. | Clust. | PairClass. | Rerank. | Retr. | STS | Summ. |
|
78 |
+
|-----------------------------|--------------|--------------|--------------|--------------|--------------|--------------|--------------|--------|
|
79 |
+
| #Datasets ➡️ | 56 | 12 | 11 | 3 | 4 | 15 | 10 | 1 |
|
80 |
+
||
|
81 |
+
| bge-large-en-v1.5 | **64.23** | **75.97** | 46.08| **87.12** | **60.03** | **54.29** | 83.11| 31.61 |
|
82 |
+
| bge-base-en-v1.5 | 63.55| 75.53| 45.77| 86.55| 58.86| 53.25| 82.4| 31.07 |
|
83 |
+
| gte-large | 63.13| 73.33| **46.84** | 85| 59.13| 52.22| **83.35** | 31.66 |
|
84 |
+
| gte-base | 62.39| 73.01| 46.2| 84.57| 58.61| 51.14| 82.3| 31.17 |
|
85 |
+
| e5-large-v2 | 62.25| 75.24| 44.49| 86.03| 56.61| 50.56| 82.05| 30.19 |
|
86 |
+
| instructor-xl | 61.79| 73.12| 44.74| 86.62| 57.29| 49.26| 83.06| 32.32 |
|
87 |
+
| instructor-large | 61.59| 73.86| 45.29| 85.89| 57.54| 47.57| 83.15| 31.84 |
|
88 |
+
| e5-base-v2 | 61.5 | 73.84| 43.8| 85.73| 55.91| 50.29| 81.05| 30.28 |
|
89 |
+
| e5-large | 61.42| 73.14| 43.33| 85.94| 56.53| 49.99| 82.06| 30.97 |
|
90 |
+
| text-embedding-ada-002 (OpenAI API) | 60.99| 70.93| 45.9 | 84.89| 56.32| 49.25| 80.97| 30.8 |
|
91 |
+
| e5-base | 60.44| 72.63| 42.11| 85.09| 55.7 | 48.75| 80.96| 31.01 |
|
92 |
+
| SGPT-5.8B-msmarco | 58.93| 68.13| 40.34| 82 | 56.56| 50.25| 78.1 | 31.46 |
|
93 |
+
| sgpt-bloom-7b1-msmarco | 57.59| 66.19| 38.93| 81.9 | 55.65| 48.22| 77.74| **33.6** |
|
94 |
+
||
|
95 |
+
| Udever-bloom-560m | 55.80| 68.04| 36.89| 81.05| 52.60| 41.19| 79.93| 32.06 |
|
96 |
+
| Udever-bloom-1b1 | 58.28| 70.18| 39.11| 83.11| 54.28| 45.27| 81.52| 31.10 |
|
97 |
+
| Udever-bloom-3b | 59.86| 71.91| 40.74| 84.06| 54.90| 47.67| 82.37| 30.62 |
|
98 |
+
| Udever-bloom-7b1 | 60.63 | 72.13| 40.81| 85.40| 55.91| 49.34| 83.01| 30.97 |
|
99 |
+
|
100 |
+
|
101 |
+
### Table 2: [CodeSearchNet](https://arxiv.org/abs/1909.09436)
|
102 |
+
|
103 |
+
| CodeSearchNet | Go | Ruby | Python | Java | JS | PHP | Avg. |
|
104 |
+
|-|-|-|-|-|-|-|-|
|
105 |
+
| CodeBERT | 69.3 | 70.6 | 84.0 | 86.8 | 74.8 | 70.6 | 76.0 |
|
106 |
+
| GraphCodeBERT | 84.1 | 73.2 | 87.9 | 75.7 | 71.1 | 72.5 | 77.4 |
|
107 |
+
| cpt-code S | **97.7** | **86.3** | 99.8 | 94.0 | 86.0 | 96.7 | 93.4 |
|
108 |
+
| cpt-code M | 97.5 | 85.5 | **99.9** | **94.4** | **86.5** | **97.2** | **93.5** |
|
109 |
+
| sgpt-bloom-7b1-msmarco | 76.79 | 69.25 | 95.68 | 77.93 | 70.35 | 73.45 | 77.24 |
|
110 |
+
||
|
111 |
+
| Udever-bloom-560m | 75.38 | 66.67 | 96.23 | 78.99 | 69.39 | 73.69 | 76.73 |
|
112 |
+
| Udever-bloom-1b1 | 78.76 | 72.85 | 97.67 | 82.77 | 74.38 | 78.97 | 80.90 |
|
113 |
+
| Udever-bloom-3b | 80.63 | 75.40 | 98.02 | 83.88 | 76.18 | 79.67 | 82.29 |
|
114 |
+
| Udever-bloom-7b1 | 79.37 | 76.59 | 98.38 | 84.68 | 77.49 | 80.03 | 82.76 |
|
115 |
+
|
116 |
+
|
117 |
+
### Table 3: Chinese multi-domain retrieval [Multi-cpr](https://dl.acm.org/doi/10.1145/3477495.3531736)
|
118 |
+
|
119 |
+
| | | |E-commerce | | Entertainment video | | Medical | |
|
120 |
+
|--|--|--|--|--|--|--|--|--|
|
121 |
+
| Model | Train | Backbone | MRR@10 | Recall@1k | MRR@10 | Recall@1k | MRR@10 | Recall@1k |
|
122 |
+
||
|
123 |
+
| BM25 | - | - | 0.225 | 0.815 | 0.225 | 0.780 | 0.187 | 0.482 |
|
124 |
+
| Doc2Query | - | - | 0.239 | 0.826 | 0.238 | 0.794 | 0.210 | 0.505 |
|
125 |
+
| DPR-1 | In-Domain | BERT | 0.270 | 0.921 | 0.254 | 0.934 | 0.327 | 0.747 |
|
126 |
+
| DPR-2 | In-Domain | BERT-CT | 0.289 | **0.926** | 0.263 | **0.935** | 0.339 | **0.769** |
|
127 |
+
| text-embedding-ada-002 | General | GPT | 0.183 | 0.825 | 0.159 | 0.786 | 0.245 | 0.593 |
|
128 |
+
| sgpt-bloom-7b1-msmarco | General | BLOOM | 0.242 | 0.840 | 0.227 | 0.829 | 0.311 | 0.675 |
|
129 |
+
||
|
130 |
+
| Udever-bloom-560m | General | BLOOM | 0.156 | 0.802 | 0.149 | 0.749 | 0.245 | 0.571 |
|
131 |
+
| Udever-bloom-1b1 | General | BLOOM | 0.244 | 0.863 | 0.208 | 0.815 | 0.241 | 0.557 |
|
132 |
+
| Udever-bloom-3b | General | BLOOM | 0.267 | 0.871 | 0.228 | 0.836 | 0.288 | 0.619 |
|
133 |
+
| Udever-bloom-7b1 | General | BLOOM | **0.296** | 0.889 | **0.267** | 0.907 | **0.343** | 0.705 |
|
134 |
+
|
135 |
+
#### More results refer to [paper](https://arxiv.org/pdf/2310.08232.pdf) section 3.
|
136 |
+
|
137 |
+
|
138 |
+
|
139 |
+
## Technical Specifications
|
140 |
+
|
141 |
+
### Model Architecture and Objective
|
142 |
+
|
143 |
+
- Model: [bigscience/bloom-560m](https://huggingface.co/bigscience/bloom-560m).
|
144 |
+
- Objective: Constrastive loss with hard negatives (refer to [paper](https://arxiv.org/pdf/2310.08232.pdf) section 2.2).
|
145 |
+
|
146 |
+
|
147 |
+
### Compute Infrastructure
|
148 |
+
|
149 |
+
- Nvidia A100 SXM4 80GB.
|
150 |
+
- torch 2.0.0, transformers 4.29.2.
|
151 |
+
|
152 |
+
|
153 |
+
## Citation
|
154 |
+
|
155 |
+
**BibTeX:**
|
156 |
+
|
157 |
+
```BibTeX
|
158 |
+
@article{zhang2023language,
|
159 |
+
title={Language Models are Universal Embedders},
|
160 |
+
author={Zhang, Xin and Li, Zehan and Zhang, Yanzhao and Long, Dingkun and Xie, Pengjun and Zhang, Meishan and Zhang, Min},
|
161 |
+
journal={arXiv preprint arXiv:2310.08232},
|
162 |
+
year={2023}
|
163 |
+
}
|
164 |
+
```
|