File size: 4,972 Bytes

f8ce38f
cfdc1d3
 
 
f8ce38f
cfdc1d3
f8ce38f
411a66c
f8ce38f
cfdc1d3
fe79c01
 
e5a216b
 
cfdc1d3
432518f
f8ce38f
 
cfdc1d3
f8ce38f
cfdc1d3
f8ce38f
 
 
cfdc1d3
 
 
 
 
 
f8ce38f
 
 
 
 
 
 
cfdc1d3
f8ce38f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cfdc1d3
 
 
 
 
2188288
cfdc1d3
 
 
 
 
 
 
 
5acd793
 
cfdc1d3
 
 
d8fb234
cfdc1d3

---
language: en
pipeline_tag: fill-mask
license: cc-by-sa-4.0
tags:
- legal
model-index:
- name: lexlms/legal-roberta-large
  results: []
widget:
- text: "The applicant submitted that her husband was subjected to treatment amounting to <mask> whilst in the custody of police."
- text: "This <mask> Agreement is between General Motors and John Murray."
- text: "Establishing a system for the identification and registration of <mask> animals and regarding the labelling of beef and beef products."
- text: "Because the Court granted <mask> before judgment, the Court effectively stands in the shoes of the Court of Appeals and reviews the defendants’ appeals."
datasets:
- lexlms/lex_files
---

# LexLM large

This model was continued pre-trained from RoBERTa large (https://huggingface.co/roberta-large) on the LeXFiles corpus (https://huggingface.co/datasets/lexlms/lexfiles).

## Model description

LexLM (Base/Large) are our newly released RoBERTa models. We follow a series of best-practices in language model development:
* We warm-start (initialize) our models from the original RoBERTa checkpoints (base or large) of Liu et al. (2019).
* We train a new tokenizer of 50k BPEs, but we reuse the original embeddings for all lexically overlapping tokens (Pfeiffer et al., 2021).
* We continue pre-training our models on the diverse LeXFiles corpus for additional 1M steps with batches of 512 samples, and a 20/30% masking rate (Wettig et al., 2022), for base/large models, respectively. 
* We use a sentence sampler with exponential smoothing of the sub-corpora sampling rate following Conneau et al. (2019) since there is a disparate proportion of tokens across sub-corpora and we aim to preserve per-corpus capacity (avoid overfitting).
* We consider mixed cased models, similar to all recently developed large PLMs.

## Intended uses & limitations

More information needed

## Training and evaluation data

The model was trained on the LeXFiles corpus (https://huggingface.co/datasets/lexlms/lexfiles). For evaluation results, please consider our work "LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development" (Chalkidis* et al, 2023).

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 0.0001
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- distributed_type: tpu
- num_devices: 8
- gradient_accumulation_steps: 4
- total_train_batch_size: 256
- total_eval_batch_size: 64
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_ratio: 0.05
- training_steps: 1000000

### Training results

| Training Loss | Epoch | Step    | Validation Loss |
|:-------------:|:-----:|:-------:|:---------------:|
| 1.1322        | 0.05  | 50000   | 0.8690          |
| 1.0137        | 0.1   | 100000  | 0.8053          |
| 1.0225        | 0.15  | 150000  | 0.7951          |
| 0.9912        | 0.2   | 200000  | 0.7786          |
| 0.976         | 0.25  | 250000  | 0.7648          |
| 0.9594        | 0.3   | 300000  | 0.7550          |
| 0.9525        | 0.35  | 350000  | 0.7482          |
| 0.9152        | 0.4   | 400000  | 0.7343          |
| 0.8944        | 0.45  | 450000  | 0.7245          |
| 0.893         | 0.5   | 500000  | 0.7216          |
| 0.8997        | 1.02  | 550000  | 0.6843          |
| 0.8517        | 1.07  | 600000  | 0.6687          |
| 0.8544        | 1.12  | 650000  | 0.6624          |
| 0.8535        | 1.17  | 700000  | 0.6565          |
| 0.8064        | 1.22  | 750000  | 0.6523          |
| 0.7953        | 1.27  | 800000  | 0.6462          |
| 0.8051        | 1.32  | 850000  | 0.6386          |
| 0.8148        | 1.37  | 900000  | 0.6383          |
| 0.8004        | 1.42  | 950000  | 0.6408          |
| 0.8031        | 1.47  | 1000000 | 0.6314          |


### Framework versions

- Transformers 4.20.0
- Pytorch 1.12.0+cu102
- Datasets 2.7.0
- Tokenizers 0.12.0

### Citation

[*Ilias Chalkidis\*, Nicolas Garneau\*, Catalina E.C. Goanta, Daniel Martin Katz, and Anders Søgaard.*
*LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development.*
*2022. In the Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Toronto, Canada.*](https://arxiv.org/abs/2305.07507)
```
@inproceedings{chalkidis-garneau-etal-2023-lexlms,
    title = {{LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development}},
    author = "Chalkidis*, Ilias and 
              Garneau*, Nicolas and
              Goanta, Catalina and 
              Katz, Daniel Martin and 
              Søgaard, Anders",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics",
    month = july,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/2305.07507",
}
```