File size: 7,340 Bytes
5065887
365a854
 
 
 
5065887
 
365a854
 
 
5065887
365a854
 
 
 
 
 
 
 
 
5065887
 
365a854
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5065887
 
365a854
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b19ae6f
365a854
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
---
language:
- en
license: apache-2.0
library_name: transformers
base_model: gpt2-xl
tags:
- law
- legal
- australia
- generated_from_trainer
datasets:
- umarbutler/open-australian-legal-corpus
widget:
- text: 'Under the Crimes Act'
- text: 'A restraint of trade is'
- text: 'Section 51 of the Constitution provides'
- text: "'Unsatisfactory professional conduct' includes"
metrics:
- perplexity
model-index:
- name: open-australian-legal-llm
  results:
  - task:
      type: text-generation
      name: Text generation
    dataset:
      type: umarbutler/open-australian-legal-qa
      name: Open Australian Legal QA
      split: train
      revision: b53a24f8edf5eb33d033a53b5b53d0a4a220d4ae
    metrics:
      - type: perplexity
        value: 8.015031389864035
        name: Perplexity
    source:
      name: lmppl
      url: https://github.com/asahi417/lmppl
---

# Open Australian Legal LLM β€βš–οΈ
The Open Australian Legal LLM is the largest open source language model trained on Australian law.

With over 1.5 billion parameters, the model's size and the richness and quality of its training data, comprising roughly 70,000 laws, regulations and decisions across six Australian jurisdictions from the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus), make it well suited for finetuning on a diverse range of downstream natural language processing tasks applied to the Australian legal domain, including text generation, text completion and question answering.

To ensure its accessibility to as wide an audience as possible, the model is issued under the [Apache Licence 2.0](https://www.apache.org/licenses/LICENSE-2.0.html).

## Usage πŸ‘©β€πŸ’»
The code snippet below demonstrates just one of the many ways in which the model may be accessed:
```python
>>> from transformers import pipeline, set_seed

>>> set_seed(42) # We set a seed for reproducibility.
>>> generator = pipeline('text-generation', model='umarbutler/open-australian-legal-llm')

>>> response = generator('Section 51 of the Constitution provides', max_length=55)
>>> print(response[0]['generated_text'])
```

## Creation πŸ§ͺ
The following cleaning procedures were applied to all 218,340 laws, regulations and decisions in version 4.2.0 of the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus):
1. Non-breaking spaces were replaced with regular spaces;
1. Return carriages followed by newlines were replaced with newlines;
1. Whitespace was removed from lines comprised entirely of whitespace;
1. Newlines and whitespace preceding newlines were removed from the end of texts;
1. Newlines and whitespace succeeding newlines were removed from the beginning of texts; and
1. Spaces and tabs were removed from the end of lines.

After cleaning, texts with less than 128 characters and those with duplicate XXH3 128-bit hashes were removed, leaving 218,207 documents. These documents were then used to pretrain a [GPT2](https://huggingface.co/gpt2-xl)-like tokenizer, after which they were split into blocks 512-tokens-long, with the tokenizer's end-of-sequence token ('<|endoftext|>') being used as a delimiter as well as to pad the end of the final block. An attention mask was applied to the end-of-sequence tokens used as padding, barring the first such token. The resulting blocks were subsequently randomly shuffled and split into a training dataset of 1,966,867 chunks and a validation dataset of 218,541.

[GPT2-XL](https://huggingface.co/gpt2-xl) was used as a base model. Input embeddings for tokens shared between the vocabulary trained on the Corpus and that of [GPT2](https://huggingface.co/gpt2-xl) were preserved but moved to their new positions. Embeddings for unique tokens were set to the average embedding weights.

The model was trained with the following hyperparameters for the first 100,290 steps:
| Hyperparameter | Value |
| --- | --- |
| Sequence length | 512 |
| Epochs | 1 |
| Optimiser | AdamW |
| Learning rate | 1e-4 |
| Learning rate scheduler | Linear with warmup |
| Batch size | 6 |
| Weight decay | 0.01 |
| Warmup ratio | 0.06 |

After training on two RTX A6000s for \~120,050 steps over a period of 91 hours, the [vast.ai](https://vast.ai) instance hosting the model crashed. Fortunately, a checkpoint had been saved at step 100,290 (~60% of an epoch), although the optimiser's state was mistakenly not downloaded. The model was subsequently moved to a new instance where it was trained on an L40 for a further 133,711 steps (~40% of an epoch) with the following hyperparameters (changes emphasised):
| Hyperparameter | Value |
| --- | --- |
| Sequence length | 512 |
| Epochs | 1 |
| Optimiser | AdamW |
| Learning rate | *4.255e-5* |
| Learning rate scheduler | *Linear* |
| Batch size | *3* |
| Weight decay | 0.01 |
| Warmup ratio | *0.00* |

Naturally, as the optimiser state had been lost, the model's learning rate descended slower than it had been previously. Nevertheless, after completing an epoch of training, the model was able to achieve a validation loss of 2.04.

## Limitations 🚧
Although the model has not been tested for bias, one would expect it to exhibit much of the same, if not all, the biases of [GPT2-XL](https://huggingface.co/gpt2-xl).

One might also expect the model to exhibit a bias towards the type of language employed in laws, regulations and decisions (its source material) as well as towards Commonwealth and New South Wales law (the largest sources of documents in the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus) at the time of the model's creation).

Finally, it is worth noting that the model may lack knowledge of Victorian, Northern Territory and Australian Capital Territory law as licensing restrictions had prevented their inclusion in the training data.

## Licence πŸ“œ
To ensure its accessibility to as wide an audience as possible, the model is issued under the [Apache Licence 2.0](https://www.apache.org/licenses/LICENSE-2.0.html).

## Citation πŸ”–
If you've relied on the model for your work, please cite:
```bibtex
@misc{butler-2023-open-australian-legal-llm,
    author = {Butler, Umar},
    year = {2023},
    title = {Open Australian Legal LLM},
    publisher = {Hugging Face},
    version = {1.0.0},
    url = {https://huggingface.co/datasets/umarbutler/open-australian-legal-llm}
}
```

## Acknowledgements πŸ™
In the spirit of reconciliation, the author acknowledges the Traditional Custodians of Country throughout Australia and their connections to land, sea and community. He pays his respect to their Elders past and present and extends that respect to all Aboriginal and Torres Strait Islander peoples today.

The author thanks the sources of the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus) for making their data available under open licences.

The author also acknowledges the developers of the many Python libraries relied upon in the training of the model, as well as the makers of [GPT2](https://huggingface.co/gpt2-xl), which the model was built atop.

Finally, the author is eternally grateful for the endless support of his wife and her willingness to put up with many a late night spent writing code and quashing bugs.