batterydata commited on
Commit
e675b6d
1 Parent(s): 76be348

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +27 -27
README.md CHANGED
@@ -30,6 +30,33 @@ This way, the model learns an inner representation of the English language that
30
  useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard
31
  classifier using the features produced by the BERT model as inputs.
32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
  ## Intended uses & limitations
34
 
35
  You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task.
@@ -72,33 +99,6 @@ encoded_input = tokenizer(text, return_tensors='tf')
72
  output = model(encoded_input)
73
  ```
74
 
75
- ## Training data
76
-
77
- The BatteryOnlyBERT model was pretrained on the full text of battery papers only. The paper corpus contains a total of 400,366 battery research papers that are published from 2000 to June 2021, from the publishers Royal Society of Chemistry (RSC), Elsevier, and Springer. The list of DOIs can be found at [Github](https://github.com/ShuHuang/batterybert/blob/main/corpus.txt).
78
-
79
- ## Training procedure
80
-
81
- ### Preprocessing
82
-
83
- The texts are lowercased and tokenized using WordPiece and a vocabulary size of 28,996. The inputs of the model are
84
- then of the form:
85
-
86
- ```
87
- [CLS] Sentence A [SEP] Sentence B [SEP]
88
- ```
89
-
90
- The details of the masking procedure for each sentence are the following:
91
- - 15% of the tokens are masked.
92
- - In 80% of the cases, the masked tokens are replaced by `[MASK]`.
93
- - In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
94
- - In the 10% remaining cases, the masked tokens are left as is.
95
-
96
- ### Pretraining
97
-
98
-
99
- The model was trained on 8 NVIDIA DGX A100 GPUs for 1,500,000 steps with a batch size of 256. The sequence length was limited to 512 tokens. The optimizer used is Adam with a learning rate of 1e-4, \\(\beta_{1} = 0.9\\) and \\(\beta_{2} = 0.999\\), a weight decay of 0.01,
100
- learning rate warmup for 10,000 steps and linear decay of the learning rate after.
101
-
102
  ## Evaluation results
103
 
104
  Final loss: 1.0614.
 
30
  useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard
31
  classifier using the features produced by the BERT model as inputs.
32
 
33
+ ## Training data
34
+
35
+ The BatteryOnlyBERT model was pretrained on the full text of battery papers only. The paper corpus contains 1.87B tokens form a total of 400,366 battery research papers that are published from 2000 to June 2021, from the publishers Royal Society of Chemistry (RSC), Elsevier, and Springer. The list of DOIs can be found at [Github](https://github.com/ShuHuang/batterybert/blob/main/corpus.txt).
36
+
37
+ ## Training procedure
38
+
39
+ ### Preprocessing
40
+
41
+ The texts are lowercased and tokenized using WordPiece and a vocabulary size of 28,996. The inputs of the model are
42
+ then of the form:
43
+
44
+ ```
45
+ [CLS] Sentence A [SEP] Sentence B [SEP]
46
+ ```
47
+
48
+ The details of the masking procedure for each sentence are the following:
49
+ - 15% of the tokens are masked.
50
+ - In 80% of the cases, the masked tokens are replaced by `[MASK]`.
51
+ - In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
52
+ - In the 10% remaining cases, the masked tokens are left as is.
53
+
54
+ ### Pretraining
55
+
56
+
57
+ The model was trained on 8 NVIDIA DGX A100 GPUs for 1,500,000 steps with a batch size of 256. The sequence length was limited to 512 tokens. The optimizer used is Adam with a learning rate of 1e-4, \\(\beta_{1} = 0.9\\) and \\(\beta_{2} = 0.999\\), a weight decay of 0.01,
58
+ learning rate warmup for 10,000 steps and linear decay of the learning rate after.
59
+
60
  ## Intended uses & limitations
61
 
62
  You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task.
 
99
  output = model(encoded_input)
100
  ```
101
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
102
  ## Evaluation results
103
 
104
  Final loss: 1.0614.