zlucia commited on
Commit
b389a58
·
1 Parent(s): 4794aa8

Update Preprocessing section

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -93,7 +93,7 @@ The Pile of Law BERT large model was pretrained on the Pile of Law, a dataset co
93
 
94
  ## Training procedure
95
  ### Preprocessing
96
- The model vocabulary consists of 29,000 tokens from a custom word-piece vocabulary fit to Pile of Law using the [HuggingFace WordPiece tokenizer](https://github.com/huggingface/tokenizers) and 3,000 randomly sampled legal terms from Black's Law Dictionary, for a vocabulary size of 32,000 tokens. The 80-10-10 masking, corruption, leave split, as in [BERT](https://arxiv.org/abs/1810.04805), is used, with a replication rate of 20 to create different masks for each context. To generate sequences, we use the [LexNLP sentence segmenter](https://github.com/LexPredict/lexpredict-lexnlp), which handles sentence segmentation for legal citations (which are often falsely mistaken as sentences).
97
 
98
  ### Pretraining
99
  The model was trained on a SambaNova cluster, with 8 RDUs, for 1.7 million steps. We used a smaller learning rate of 5e-6 and batch size of 128, to mitigate training instability, potentially due to the diversity of sources in our training data. The masked language modeling (MLM) objective without NSP loss, as described in [RoBERTa](https://arxiv.org/abs/1907.11692), was used for pretraining. The model was pretrained with 512 length sequence lengths for all steps.
 
93
 
94
  ## Training procedure
95
  ### Preprocessing
96
+ The model vocabulary consists of 29,000 tokens from a custom word-piece vocabulary fit to Pile of Law using the [HuggingFace WordPiece tokenizer](https://github.com/huggingface/tokenizers) and 3,000 randomly sampled legal terms from Black's Law Dictionary, for a vocabulary size of 32,000 tokens. The 80-10-10 masking, corruption, leave split, as in [BERT](https://arxiv.org/abs/1810.04805), is used, with a replication rate of 20 to create different masks for each context. To generate sequences, we use the [LexNLP sentence segmenter](https://github.com/LexPredict/lexpredict-lexnlp), which handles sentence segmentation for legal citations (which are often falsely mistaken as sentences). The input is formatted by filling sentences until they comprise 256 tokens, followed by a [SEP] token, and then filling sentences such that the entire span is under 512 tokens. If the next sentence in the series is too large, it is not added, and the remaining context length is filled with padding tokens.
97
 
98
  ### Pretraining
99
  The model was trained on a SambaNova cluster, with 8 RDUs, for 1.7 million steps. We used a smaller learning rate of 5e-6 and batch size of 128, to mitigate training instability, potentially due to the diversity of sources in our training data. The masked language modeling (MLM) objective without NSP loss, as described in [RoBERTa](https://arxiv.org/abs/1907.11692), was used for pretraining. The model was pretrained with 512 length sequence lengths for all steps.