umarbutler commited on
Commit
365a854
β€’
1 Parent(s): 5065887

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +121 -44
README.md CHANGED
@@ -1,52 +1,129 @@
1
  ---
2
- license: mit
 
 
 
3
  base_model: gpt2-xl
4
  tags:
 
 
 
5
  - generated_from_trainer
 
 
 
 
 
 
 
 
 
6
  model-index:
7
  - name: open-australian-legal-llm
8
- results: []
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  ---
10
 
11
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
12
- should probably proofread and complete it, then remove this comment. -->
13
-
14
- # open-australian-legal-llm
15
-
16
- This model is a fine-tuned version of [gpt2-xl](https://huggingface.co/gpt2-xl) on an unknown dataset.
17
-
18
- ## Model description
19
-
20
- More information needed
21
-
22
- ## Intended uses & limitations
23
-
24
- More information needed
25
-
26
- ## Training and evaluation data
27
-
28
- More information needed
29
-
30
- ## Training procedure
31
-
32
- ### Training hyperparameters
33
-
34
- The following hyperparameters were used during training:
35
- - learning_rate: 4.255e-05
36
- - train_batch_size: 3
37
- - eval_batch_size: 3
38
- - seed: 42
39
- - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
40
- - lr_scheduler_type: linear
41
- - num_epochs: 1
42
-
43
- ### Training results
44
-
45
-
46
-
47
- ### Framework versions
48
-
49
- - Transformers 4.36.0
50
- - Pytorch 2.0.1
51
- - Datasets 2.15.0
52
- - Tokenizers 0.15.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ library_name: transformers
6
  base_model: gpt2-xl
7
  tags:
8
+ - law
9
+ - legal
10
+ - australia
11
  - generated_from_trainer
12
+ datasets:
13
+ - umarbutler/open-australian-legal-corpus
14
+ widget:
15
+ - text: 'Under the Crimes Act'
16
+ - text: 'A restraint of trade is'
17
+ - text: 'Section 51 of the Constitution provides'
18
+ - text: "'Unsatisfactory professional conduct' includes"
19
+ metrics:
20
+ - perplexity
21
  model-index:
22
  - name: open-australian-legal-llm
23
+ results:
24
+ - task:
25
+ type: text-generation
26
+ name: Text generation
27
+ dataset:
28
+ type: umarbutler/open-australian-legal-qa
29
+ name: Open Australian Legal QA
30
+ split: train
31
+ revision: b53a24f8edf5eb33d033a53b5b53d0a4a220d4ae
32
+ metrics:
33
+ - type: perplexity
34
+ value: 8.015031389864035
35
+ name: Perplexity
36
+ source:
37
+ name: lmppl
38
+ url: https://github.com/asahi417/lmppl
39
  ---
40
 
41
+ # Open Australian Legal LLM β€βš–οΈ
42
+ The Open Australian Legal LLM is the largest open source language model trained on Australian law.
43
+
44
+ With over 1.5 billion parameters, the model's size and the richness and quality of its training data, comprising roughly 70,000 laws, regulations and decisions across six Australian jurisdictions from the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus), make it well suited for finetuning on a diverse range of downstream natural language processing tasks applied to the Australian legal domain, including text generation, text completion and question answering.
45
+
46
+ To ensure its accessibility to as wide an audience as possible, the model is issued under the [Apache Licence 2.0](https://www.apache.org/licenses/LICENSE-2.0.html).
47
+
48
+ ## Usage πŸ‘©β€πŸ’»
49
+ The code snippet below demonstrates just one of the many ways in which the model may be accessed:
50
+ ```python
51
+ >>> from transformers import pipeline, set_seed
52
+
53
+ >>> set_seed(42) # We set a seed for reproducibility.
54
+ >>> generator = pipeline('text-generation', model='umarbutler/open-australian-legal-llm')
55
+
56
+ >>> response = generator('Section 51 of the Constitution provides', max_length=55)
57
+ >>> print(response[0]['generated_text'])
58
+ ```
59
+
60
+ ## Creation πŸ§ͺ
61
+ The following cleaning procedures were applied to all 218,340 laws, regulations and decisions in version 4.2.0 of the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus):
62
+ 1. Non-breaking spaces were replaced with regular spaces;
63
+ 1. Return carriages followed by newlines were replaced with newlines;
64
+ 1. Whitespace was removed from lines comprised entirely of whitespace;
65
+ 1. Newlines and whitespace preceding newlines were removed from the end of texts;
66
+ 1. Newlines and whitespace succeeding newlines were removed from the beginning of texts; and
67
+ 1. Spaces and tabs were removed from the end of lines.
68
+
69
+ After cleaning, texts with less than 128 characters and those with duplicate XXH3 128-bit hashes were removed, leaving 218,207 documents. These documents were then used to pretrain a [GPT2](https://huggingface.co/gpt2-xl)-like tokenizer, after which they were split into blocks 512-tokens-long, with the tokenizer's end-of-sequence token ('<|endoftext|>') being used as a delimiter as well as to pad the end of the final block. An attention mask was applied to the end-of-sequence tokens used as padding, barring the first such token. The resulting blocks were subsequently randomly shuffled and split into a training dataset of 1,966,867 chunks and a validation dataset of 218,541.
70
+
71
+ [GPT2-XL](https://huggingface.co/gpt2-xl) was used as a base model. Input embeddings for tokens shared between the vocabulary trained on the Corpus and that of [GPT2](https://huggingface.co/gpt2-xl) were preserved but moved to their new positions. Embeddings for unique tokens were set to the average embedding weights.
72
+
73
+ The model was trained with the following hyperparameters for the first 100,290 steps:
74
+ | Hyperparameter | Value |
75
+ | --- | --- |
76
+ | Sequence length | 512 |
77
+ | Epochs | 1 |
78
+ | Optimiser | AdamW |
79
+ | Learning rate | 1e-4 |
80
+ | Learning rate scheduler | Linear with warmup |
81
+ | Batch size | 6 |
82
+ | Weight decay | 0.01 |
83
+ | Warmup ratio | 0.06 |
84
+
85
+ After training on two RTX A6000s for ~120,050 steps over a period of 91 hours, the [vast.ai](https://vast.ai) instance hosting the model crashed. Fortunately, a checkpoint had been saved at step 100,290 (~60% of an epoch), although the optimiser's state was mistakenly not downloaded. The model was subsequently moved to a new instance where it was trained on an L40 for a further 133,711 steps (~40% of an epoch) with the following hyperparameters (changes emphasised):
86
+ | Hyperparameter | Value |
87
+ | --- | --- |
88
+ | Sequence length | 512 |
89
+ | Epochs | 1 |
90
+ | Optimiser | AdamW |
91
+ | Learning rate | *4.255e-5* |
92
+ | Learning rate scheduler | *Linear* |
93
+ | Batch size | *3* |
94
+ | Weight decay | 0.01 |
95
+ | Warmup ratio | *0.00* |
96
+
97
+ Naturally, as the optimiser state had been lost, the model's learning rate descended slower than it had been previously. Nevertheless, after completing an epoch of training, the model was able to achieve a validation loss of 2.04.
98
+
99
+ ## Limitations 🚧
100
+ Although the model has not been tested for bias, one would expect it to exhibit much of the same, if not all, the biases of [GPT2-XL](https://huggingface.co/gpt2-xl).
101
+
102
+ One might also expect the model to exhibit a bias towards the type of language employed in laws, regulations and decisions (its source material) as well as towards Commonwealth and New South Wales law (the largest sources of documents in the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus) at the time of the model's creation).
103
+
104
+ Finally, it is worth noting that the model may lack knowledge of Victorian, Northern Territory and Australian Capital Territory law as licensing restrictions had prevented their inclusion in the training data.
105
+
106
+ ## Licence πŸ“œ
107
+ To ensure its accessibility to as wide an audience as possible, the model is issued under the [Apache Licence 2.0](https://www.apache.org/licenses/LICENSE-2.0.html).
108
+
109
+ ## Citation πŸ”–
110
+ If you've relied on the model for your work, please cite:
111
+ ```bibtex
112
+ @misc{butler-2023-open-australian-legal-llm,
113
+ author = {Butler, Umar},
114
+ year = {2023},
115
+ title = {Open Australian Legal LLM},
116
+ publisher = {Hugging Face},
117
+ version = {1.0.0},
118
+ url = {https://huggingface.co/datasets/umarbutler/open-australian-legal-llm}
119
+ }
120
+ ```
121
+
122
+ ## Acknowledgements πŸ™
123
+ In the spirit of reconciliation, the author acknowledges the Traditional Custodians of Country throughout Australia and their connections to land, sea and community. He pays his respect to their Elders past and present and extends that respect to all Aboriginal and Torres Strait Islander peoples today.
124
+
125
+ The author thanks the sources of the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus) for making their data available under open licences.
126
+
127
+ The author also acknowledges the developers of the many Python libraries relied upon in the training of the model, as well as the makers of [GPT2](https://huggingface.co/gpt2-xl), which the model was built atop.
128
+
129
+ Finally, the author is eternally grateful for the endless support of his wife and her willingness to put up with many a late night spent writing code and quashing bugs.