Add usage examples
Browse files
README.md
CHANGED
@@ -1,6 +1,8 @@
|
|
1 |
---
|
2 |
language:
|
3 |
- en
|
|
|
|
|
4 |
pipeline_tag: fill-mask
|
5 |
---
|
6 |
|
@@ -8,14 +10,83 @@ pipeline_tag: fill-mask
|
|
8 |
Pretrained model on English language legal and administrative text using the [RoBERTa](https://arxiv.org/abs/1907.11692) pretraining objective.
|
9 |
|
10 |
## Model description
|
11 |
-
Pile of Law BERT large is a transformers model with the [BERT large model (uncased)](https://huggingface.co/bert-large-uncased) architecture pretrained on the Pile of Law, a dataset consisting of ~256GB of English language legal and administrative text for language model pretraining.
|
12 |
|
13 |
## Intended uses & limitations
|
14 |
You can use the raw model for masked language modeling or fine-tune it for a downstream task. Since this model was pretrained on a English language legal and administrative text corpus, legal downstream tasks will likely be more in-domain for this model.
|
15 |
|
16 |
## How to use
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
17 |
|
18 |
## Limitations and bias
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
19 |
|
20 |
## Training data
|
21 |
The Pile of Law BERT large model was pretrained on the Pile of Law, a dataset consisting of ~256GB of English language legal and administrative text for language model pretraining. The Pile of Law consists of 35 data sources, including legal analyses, court opinions and filings, government agency publications, contracts, statutes, regulations, casebooks, etc. We describe the data sources in detail in Appendix E of the Pile of Law paper. The Pile of Law dataset is placed under a CreativeCommons Attribution-NonCommercial-ShareAlike 4.0 International license.
|
|
|
1 |
---
|
2 |
language:
|
3 |
- en
|
4 |
+
datasets:
|
5 |
+
- pile-of-law/pile-of-law
|
6 |
pipeline_tag: fill-mask
|
7 |
---
|
8 |
|
|
|
10 |
Pretrained model on English language legal and administrative text using the [RoBERTa](https://arxiv.org/abs/1907.11692) pretraining objective.
|
11 |
|
12 |
## Model description
|
13 |
+
Pile of Law BERT large is a transformers model with the [BERT large model (uncased)](https://huggingface.co/bert-large-uncased) architecture pretrained on the [Pile of Law](https://huggingface.co/datasets/pile-of-law/pile-of-law), a dataset consisting of ~256GB of English language legal and administrative text for language model pretraining.
|
14 |
|
15 |
## Intended uses & limitations
|
16 |
You can use the raw model for masked language modeling or fine-tune it for a downstream task. Since this model was pretrained on a English language legal and administrative text corpus, legal downstream tasks will likely be more in-domain for this model.
|
17 |
|
18 |
## How to use
|
19 |
+
You can use the model directly with a pipeline for masked language modeling:
|
20 |
+
```python
|
21 |
+
>>> from transformers import pipeline
|
22 |
+
>>> pipe = pipeline(task='fill-mask', model='pile-of-law/legalbert-large-1.7M-1')
|
23 |
+
>>> pipe("An [MASK] is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.")
|
24 |
+
|
25 |
+
[{'sequence': 'an appeal is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.',
|
26 |
+
'score': 0.6343119740486145,
|
27 |
+
'token': 1151, '
|
28 |
+
token_str': 'appeal'},
|
29 |
+
{'sequence': 'an objection is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.',
|
30 |
+
'score': 0.10488124936819077,
|
31 |
+
'token': 3542,
|
32 |
+
'token_str': 'objection'},
|
33 |
+
{'sequence': 'an application is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.',
|
34 |
+
'score': 0.0708756372332573,
|
35 |
+
'token': 1999,
|
36 |
+
'token_str': 'application'},
|
37 |
+
{'sequence': 'an example is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.',
|
38 |
+
'score': 0.02558572217822075,
|
39 |
+
'token': 3677,
|
40 |
+
'token_str': 'example'},
|
41 |
+
{'sequence': 'an action is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.',
|
42 |
+
'score': 0.013266939669847488,
|
43 |
+
'token': 1347,
|
44 |
+
'token_str': 'action'}]
|
45 |
+
```
|
46 |
+
|
47 |
+
Here is how to use this model to get the features of a given text in PyTorch:
|
48 |
+
|
49 |
+
```python
|
50 |
+
from transformers import BertTokenizer, BertModel
|
51 |
+
tokenizer = BertTokenizer.from_pretrained('pile-of-law/legalbert-large-1.7M-1')
|
52 |
+
model = BertModel.from_pretrained('pile-of-law/legalbert-large-1.7M-1')
|
53 |
+
text = "Replace me by any text you'd like."
|
54 |
+
encoded_input = tokenizer(text, return_tensors='pt')
|
55 |
+
output = model(**encoded_input)
|
56 |
+
```
|
57 |
+
|
58 |
+
and in TensorFlow:
|
59 |
+
|
60 |
+
```python
|
61 |
+
from transformers import BertTokenizer, TFBertModel
|
62 |
+
tokenizer = BertTokenizer.from_pretrained('pile-of-law/legalbert-large-1.7M-1')
|
63 |
+
model = TFBertModel.from_pretrained('pile-of-law/legalbert-large-1.7M-1')
|
64 |
+
text = "Replace me by any text you'd like."
|
65 |
+
encoded_input = tokenizer(text, return_tensors='tf')
|
66 |
+
output = model(encoded_input)
|
67 |
+
```
|
68 |
|
69 |
## Limitations and bias
|
70 |
+
Please see Appendix G of the Pile of Law paper for copyright limitations related to dataset and model use.
|
71 |
+
|
72 |
+
This model can have biased predictions. In the following example where the model is used with a pipeline for masked language modeling, for the race descriptor of the criminal, the model predicts a higher score for "black" than "white".
|
73 |
+
|
74 |
+
```python
|
75 |
+
>>> from transformers import pipeline
|
76 |
+
>>> pipe = pipeline(task='fill-mask', model='pile-of-law/legalbert-large-1.7M-1')
|
77 |
+
>>> pipe("The clerk described the robber as a “thin [MASK] male, about six foot tall, wearing a gray hoodie, blue jeans", targets=["black", "white"])
|
78 |
+
|
79 |
+
[{'sequence': 'the clerk described the robber as a thin black male, about six foot tall, wearing a gray hoodie, blue jeans',
|
80 |
+
'score': 0.0013972163433209062,
|
81 |
+
'token': 4311,
|
82 |
+
'token_str': 'black'},
|
83 |
+
{'sequence': 'the clerk described the robber as a thin white male, about six foot tall, wearing a gray hoodie, blue jeans',
|
84 |
+
'score': 0.0009401230490766466,
|
85 |
+
'token': 4249, '
|
86 |
+
token_str': 'white'}]
|
87 |
+
```
|
88 |
+
|
89 |
+
This bias will also affect all fine-tuned versions of this model.
|
90 |
|
91 |
## Training data
|
92 |
The Pile of Law BERT large model was pretrained on the Pile of Law, a dataset consisting of ~256GB of English language legal and administrative text for language model pretraining. The Pile of Law consists of 35 data sources, including legal analyses, court opinions and filings, government agency publications, contracts, statutes, regulations, casebooks, etc. We describe the data sources in detail in Appendix E of the Pile of Law paper. The Pile of Law dataset is placed under a CreativeCommons Attribution-NonCommercial-ShareAlike 4.0 International license.
|