Update Preprocessing section
Browse files
README.md
CHANGED
@@ -93,7 +93,7 @@ The Pile of Law BERT large model was pretrained on the Pile of Law, a dataset co
|
|
93 |
|
94 |
## Training procedure
|
95 |
### Preprocessing
|
96 |
-
The model vocabulary consists of 29,000 tokens from a custom word-piece vocabulary fit to Pile of Law using the [HuggingFace WordPiece tokenizer](https://github.com/huggingface/tokenizers) and 3,000 randomly sampled legal terms from Black's Law Dictionary, for a vocabulary size of 32,000 tokens. The 80-10-10 masking, corruption, leave split, as in [BERT](https://arxiv.org/abs/1810.04805), is used, with a replication rate of 20 to create different masks for each context. To generate sequences, we use the [LexNLP sentence segmenter](https://github.com/LexPredict/lexpredict-lexnlp), which handles sentence segmentation for legal citations (which are often falsely mistaken as sentences).
|
97 |
|
98 |
### Pretraining
|
99 |
The model was trained on a SambaNova cluster, with 8 RDUs, for 1.7 million steps. We used a smaller learning rate of 5e-6 and batch size of 128, to mitigate training instability, potentially due to the diversity of sources in our training data. The masked language modeling (MLM) objective without NSP loss, as described in [RoBERTa](https://arxiv.org/abs/1907.11692), was used for pretraining. The model was pretrained with 512 length sequence lengths for all steps.
|
|
|
93 |
|
94 |
## Training procedure
|
95 |
### Preprocessing
|
96 |
+
The model vocabulary consists of 29,000 tokens from a custom word-piece vocabulary fit to Pile of Law using the [HuggingFace WordPiece tokenizer](https://github.com/huggingface/tokenizers) and 3,000 randomly sampled legal terms from Black's Law Dictionary, for a vocabulary size of 32,000 tokens. The 80-10-10 masking, corruption, leave split, as in [BERT](https://arxiv.org/abs/1810.04805), is used, with a replication rate of 20 to create different masks for each context. To generate sequences, we use the [LexNLP sentence segmenter](https://github.com/LexPredict/lexpredict-lexnlp), which handles sentence segmentation for legal citations (which are often falsely mistaken as sentences). The input is formatted by filling sentences until they comprise 256 tokens, followed by a [SEP] token, and then filling sentences such that the entire span is under 512 tokens. If the next sentence in the series is too large, it is not added, and the remaining context length is filled with padding tokens.
|
97 |
|
98 |
### Pretraining
|
99 |
The model was trained on a SambaNova cluster, with 8 RDUs, for 1.7 million steps. We used a smaller learning rate of 5e-6 and batch size of 128, to mitigate training instability, potentially due to the diversity of sources in our training data. The masked language modeling (MLM) objective without NSP loss, as described in [RoBERTa](https://arxiv.org/abs/1907.11692), was used for pretraining. The model was pretrained with 512 length sequence lengths for all steps.
|