pile-of-law
/

legalbert-large-1.7M-1

Fill-Mask

Transformers

PyTorch

English

bert

Inference Endpoints

Model card Files Files and versions Community

zlucia commited on Jun 17, 2022

Commit

4794aa8

1 Parent(s): 33eb38c

Add usage examples

Browse files

Files changed (1) hide show

README.md +72 -1

README.md CHANGED Viewed

@@ -1,6 +1,8 @@
 ---
 language:
   - en
 pipeline_tag: fill-mask
 ---
@@ -8,14 +10,83 @@ pipeline_tag: fill-mask
 Pretrained model on English language legal and administrative text using the [RoBERTa](https://arxiv.org/abs/1907.11692) pretraining objective.
 ## Model description
-Pile of Law BERT large is a transformers model with the [BERT large model (uncased)](https://huggingface.co/bert-large-uncased) architecture pretrained on the Pile of Law, a dataset consisting of ~256GB of English language legal and administrative text for language model pretraining.
 ## Intended uses & limitations
 You can use the raw model for masked language modeling or fine-tune it for a downstream task. Since this model was pretrained on a English language legal and administrative text corpus, legal downstream tasks will likely be more in-domain for this model.
 ## How to use
 ## Limitations and bias
 ## Training data
 The Pile of Law BERT large model was pretrained on the Pile of Law, a dataset consisting of ~256GB of English language legal and administrative text for language model pretraining. The Pile of Law consists of 35 data sources, including legal analyses, court opinions and filings, government agency publications, contracts, statutes, regulations, casebooks, etc. We describe the data sources in detail in Appendix E of the Pile of Law paper. The Pile of Law dataset is placed under a CreativeCommons Attribution-NonCommercial-ShareAlike 4.0 International license.

 ---
 language:
   - en
+datasets:
+  - pile-of-law/pile-of-law
 pipeline_tag: fill-mask
 ---
 Pretrained model on English language legal and administrative text using the [RoBERTa](https://arxiv.org/abs/1907.11692) pretraining objective.
 ## Model description
+Pile of Law BERT large is a transformers model with the [BERT large model (uncased)](https://huggingface.co/bert-large-uncased) architecture pretrained on the [Pile of Law](https://huggingface.co/datasets/pile-of-law/pile-of-law), a dataset consisting of ~256GB of English language legal and administrative text for language model pretraining.
 ## Intended uses & limitations
 You can use the raw model for masked language modeling or fine-tune it for a downstream task. Since this model was pretrained on a English language legal and administrative text corpus, legal downstream tasks will likely be more in-domain for this model.
 ## How to use
+You can use the model directly with a pipeline for masked language modeling:
+```python
+>>> from transformers import pipeline
+>>> pipe = pipeline(task='fill-mask', model='pile-of-law/legalbert-large-1.7M-1')
+>>> pipe("An [MASK] is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.")
+[{'sequence': 'an appeal is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.',
+  'score': 0.6343119740486145,
+  'token': 1151, '
+  token_str': 'appeal'},
+  {'sequence': 'an objection is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.',
+  'score': 0.10488124936819077,
+  'token': 3542,
+  'token_str': 'objection'},
+  {'sequence': 'an application is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.',
+  'score': 0.0708756372332573,
+  'token': 1999,
+  'token_str': 'application'},
+  {'sequence': 'an example is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.',
+  'score': 0.02558572217822075,
+  'token': 3677,
+  'token_str': 'example'},
+  {'sequence': 'an action is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.',
+  'score': 0.013266939669847488,
+  'token': 1347,
+  'token_str': 'action'}]
+```
+Here is how to use this model to get the features of a given text in PyTorch:
+```python
+from transformers import BertTokenizer, BertModel
+tokenizer = BertTokenizer.from_pretrained('pile-of-law/legalbert-large-1.7M-1')
+model = BertModel.from_pretrained('pile-of-law/legalbert-large-1.7M-1')
+text = "Replace me by any text you'd like."
+encoded_input = tokenizer(text, return_tensors='pt')
+output = model(**encoded_input)
+```
+and in TensorFlow:
+```python
+from transformers import BertTokenizer, TFBertModel
+tokenizer = BertTokenizer.from_pretrained('pile-of-law/legalbert-large-1.7M-1')
+model = TFBertModel.from_pretrained('pile-of-law/legalbert-large-1.7M-1')
+text = "Replace me by any text you'd like."
+encoded_input = tokenizer(text, return_tensors='tf')
+output = model(encoded_input)
+```
 ## Limitations and bias
+Please see Appendix G of the Pile of Law paper for copyright limitations related to dataset and model use.
+This model can have biased predictions. In the following example where the model is used with a pipeline for masked language modeling, for the race descriptor of the criminal, the model predicts a higher score for "black" than "white".
+```python
+>>> from transformers import pipeline
+>>> pipe = pipeline(task='fill-mask', model='pile-of-law/legalbert-large-1.7M-1')
+>>> pipe("The clerk described the robber as a “thin [MASK] male, about six foot tall, wearing a gray hoodie, blue jeans", targets=["black", "white"])
+[{'sequence': 'the clerk described the robber as a thin black male, about six foot tall, wearing a gray hoodie, blue jeans',
+  'score': 0.0013972163433209062,
+  'token': 4311,
+  'token_str': 'black'},
+  {'sequence': 'the clerk described the robber as a thin white male, about six foot tall, wearing a gray hoodie, blue jeans',
+  'score': 0.0009401230490766466,
+  'token': 4249, '
+  token_str': 'white'}]
+```
+This bias will also affect all fine-tuned versions of this model.
 ## Training data
 The Pile of Law BERT large model was pretrained on the Pile of Law, a dataset consisting of ~256GB of English language legal and administrative text for language model pretraining. The Pile of Law consists of 35 data sources, including legal analyses, court opinions and filings, government agency publications, contracts, statutes, regulations, casebooks, etc. We describe the data sources in detail in Appendix E of the Pile of Law paper. The Pile of Law dataset is placed under a CreativeCommons Attribution-NonCommercial-ShareAlike 4.0 International license.