LouisCastricato
commited on
Commit
•
e0c4448
1
Parent(s):
a4ee477
Update model with weights trained on deduplicated `The Pile` (#1)
Browse files- [update] Release model weights from deduped Pile training (04653c1139332fc0745f4c0290f25ccb0e8d090b)
- [fix] Make `Training Data` section header-2 (7664ba3e78aae9c3b90389cb3241fbcb8e9f0b69)
- README.md +47 -15
- config.json +1 -1
- pytorch_model.bin +1 -1
README.md
CHANGED
@@ -6,15 +6,18 @@ datasets:
|
|
6 |
- pile
|
7 |
metrics:
|
8 |
- nDCG@10
|
|
|
9 |
---
|
10 |
|
11 |
# Carptriever-1
|
12 |
|
13 |
-
|
|
|
14 |
|
15 |
Carptriever-1 is a `bert-large-uncased` retrieval model trained with contrastive learning via a momentum contrastive (MoCo) mechanism following the work of G. Izacard et al. in ["Contriever: Unsupervised Dense Information Retrieval with Contrastive Learning"](https://arxiv.org/abs/2112.09118).
|
16 |
|
17 |
-
|
|
|
18 |
|
19 |
```python
|
20 |
from transformers import AutoTokenizer, AutoModel
|
@@ -51,11 +54,13 @@ for sentence, score in sentence_score_pairs:
|
|
51 |
print(f"\nSentence: {sentence}\nScore: {score:.4f}")
|
52 |
```
|
53 |
|
54 |
-
# Training data
|
55 |
|
56 |
-
|
|
|
|
|
57 |
|
58 |
-
|
|
|
59 |
|
60 |
The model was trained on 32 40GB A100 for approximately 100 hours with the following configurations:
|
61 |
|
@@ -73,24 +78,51 @@ The model was trained on 32 40GB A100 for approximately 100 hours with the follo
|
|
73 |
- `momentum = 0.999`
|
74 |
- `temperature = 0.05`
|
75 |
|
76 |
-
# Evaluation results
|
77 |
|
78 |
-
|
|
|
|
|
79 |
|
80 |
-
|
81 |
-
|------------------------------------------|-------------|---------|------------|----------|------------------|----------|------|---------|-------------|-------|-------------|---------|---------|-------|---------------|---------|
|
82 |
-
| Contriever* | 35.97 | 20.6 | 27.4 | 31.7 | 25.4 | 48.1 | 24.5 | 37.9 | 19.3 | 83.5 | 28.4 | 29.2 | 14.9 | 68.2 | 15.5 | 64.9 |
|
83 |
-
| Carptriever-1 | 34.29 | 18.81 | **46.5** | 28.9 | 21.1 | 39.01 | 20.2 | 33.4 | 17.3 | 80.6 | 25.4 | 23.6 | 14.9 | 59.6 | **18.7** | **66.4** |
|
84 |
|
85 |
-
|
|
|
|
|
|
|
|
|
|
|
86 |
|
87 |
Note that degradation in performance, relative to the Contriever model, was expected given the much broader diversity of our training dataset. We plan on addressing this in future updates with architectural improvements and view Carptriever-1 as our first iteration in the exploratory phase towards better language-embedding models.
|
88 |
|
89 |
-
# Appreciation
|
90 |
|
91 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
92 |
|
93 |
-
|
94 |
|
95 |
```bibtex
|
96 |
@misc{izacard2021contriever,
|
|
|
6 |
- pile
|
7 |
metrics:
|
8 |
- nDCG@10
|
9 |
+
- MRR
|
10 |
---
|
11 |
|
12 |
# Carptriever-1
|
13 |
|
14 |
+
|
15 |
+
## Model description
|
16 |
|
17 |
Carptriever-1 is a `bert-large-uncased` retrieval model trained with contrastive learning via a momentum contrastive (MoCo) mechanism following the work of G. Izacard et al. in ["Contriever: Unsupervised Dense Information Retrieval with Contrastive Learning"](https://arxiv.org/abs/2112.09118).
|
18 |
|
19 |
+
|
20 |
+
## How to use
|
21 |
|
22 |
```python
|
23 |
from transformers import AutoTokenizer, AutoModel
|
|
|
54 |
print(f"\nSentence: {sentence}\nScore: {score:.4f}")
|
55 |
```
|
56 |
|
|
|
57 |
|
58 |
+
## Training data
|
59 |
+
|
60 |
+
Carptriever-1 is pre-trained on a de-duplicated subset of [The Pile](https://pile.eleuther.ai/), a large and diverse dataset created by EleutherAI for language model training. This subset was created through a [Minhash LSH](http://ekzhu.com/datasketch/lsh.html) process using a threshold of `0.87`.
|
61 |
|
62 |
+
|
63 |
+
## Training procedure
|
64 |
|
65 |
The model was trained on 32 40GB A100 for approximately 100 hours with the following configurations:
|
66 |
|
|
|
78 |
- `momentum = 0.999`
|
79 |
- `temperature = 0.05`
|
80 |
|
|
|
81 |
|
82 |
+
## Evaluation results
|
83 |
+
|
84 |
+
#### [BEIR: Benchmarking IR](https://github.com/beir-cellar/beir)
|
85 |
|
86 |
+
We report the following BEIR scores as measured in normalized discounted cumulative gain (nDCG@10):
|
|
|
|
|
|
|
87 |
|
88 |
+
| Model | Avg | MSMARCO | TREC-Covid | NFCorpus | NaturalQuestions | HotpotQA | FiQA | ArguAna | Tóuche-2020 | Quora | CQAdupstack | DBPedia | Scidocs | Fever | Climate-fever | Scifact |
|
89 |
+
|---------------|-------|---------|------------|----------|------------------|----------|------|---------|-------------|-------|-------------|---------|---------|-------|---------------|----------|
|
90 |
+
| Contriever* | 35.97 | 20.6 | 27.4 | 31.7 | 25.4 | 48.1 | 24.5 | 37.9 | 19.3 | 83.5 | 28.40 | 29.2 | 14.9 | 68.20 | 15.5 | 64.9 |
|
91 |
+
| Carptriever-1 | 34.54 | 18.83 | **52.2** | 28.5 | 21.1 | 39.4 | 23.2 | 31.7 | 15.2 | 81.3 | 26.88 | 25.4 | 14.2 | 57.36 | **17.9** | 64.9 |
|
92 |
+
|
93 |
+
\* Results are taken from the Contriever [GitHub repository](https://github.com/facebookresearch/contriever).
|
94 |
|
95 |
Note that degradation in performance, relative to the Contriever model, was expected given the much broader diversity of our training dataset. We plan on addressing this in future updates with architectural improvements and view Carptriever-1 as our first iteration in the exploratory phase towards better language-embedding models.
|
96 |
|
|
|
97 |
|
98 |
+
#### [CodeSearchNet Challenge Evaluating the State of Semantic Code Search](https://arxiv.org/pdf/1909.09436.pdf)
|
99 |
+
|
100 |
+
We provide results on the CodeSearchNet benchmark, measured in Mean Reciprocal Rank (MRR), following the code search procedure outlined in Section 3.3 of Neelakantan et al.'s ["Text and Code Embeddings by Contrastive Pre-Training"](https://arxiv.org/pdf/2201.10005.pdf).
|
101 |
+
|
102 |
+
`Candidate Size = 1,000`
|
103 |
+
|
104 |
+
| Model | Avg | Python | Go | Ruby | PHP | Java | JS |
|
105 |
+
|-----------------|-------|--------|-------|-------|-------|-------|-------|
|
106 |
+
| Carptriever-1 | 60.24 | 65.85 | 63.29 | 62.1 | 59.1 | 55.52 | 55.55 |
|
107 |
+
| Contriever | 49.39 | 54.81 | 58.9 | 55.19 | 38.46 | 44.89 | 44.09 |
|
108 |
+
|
109 |
+
|
110 |
+
`Candidate Size = 10,000`
|
111 |
+
|
112 |
+
| Model. | Avg | Python | Go | Ruby | PHP | Java | JS |
|
113 |
+
|-----------------|-------|--------|-------|-------|-------|-------|-------|
|
114 |
+
| Carptriever-1 | 48.59 | 55.98 | 43.18 | 56.06 | 45.62 | 46.04 | 44.66 |
|
115 |
+
| Contriever | 37 | 45.43 | 36.08 | 48.07 | 25.59 | 32.89 | 31.44 |
|
116 |
+
|
117 |
+
|
118 |
+
## Acknowledgements
|
119 |
+
|
120 |
+
This work would not have been possible without the compute support of [Stability AI](https://stability.ai/).
|
121 |
+
|
122 |
+
Thank you to Louis Castricato for research guidance and Reshinth Adithyan for creating the CodeSearchNet evaluation script.
|
123 |
+
|
124 |
|
125 |
+
## Citations
|
126 |
|
127 |
```bibtex
|
128 |
@misc{izacard2021contriever,
|
config.json
CHANGED
@@ -20,7 +20,7 @@
|
|
20 |
"pooling": "average",
|
21 |
"position_embedding_type": "absolute",
|
22 |
"torch_dtype": "float32",
|
23 |
-
"transformers_version": "4.
|
24 |
"type_vocab_size": 2,
|
25 |
"use_cache": true,
|
26 |
"vocab_size": 30522
|
|
|
20 |
"pooling": "average",
|
21 |
"position_embedding_type": "absolute",
|
22 |
"torch_dtype": "float32",
|
23 |
+
"transformers_version": "4.21.3",
|
24 |
"type_vocab_size": 2,
|
25 |
"use_cache": true,
|
26 |
"vocab_size": 30522
|
pytorch_model.bin
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
size 1336496437
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:63f84de125685fd11067227632f0be12e124cfde031d4b47460e2220b8ee5f84
|
3 |
size 1336496437
|