Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,77 @@
|
|
1 |
---
|
|
|
|
|
|
|
|
|
|
|
2 |
license: mit
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
language:
|
3 |
+
- pt
|
4 |
+
thumbnail: Portuguese T5 for the Legal Domain
|
5 |
+
tags:
|
6 |
+
- transformers
|
7 |
license: mit
|
8 |
+
pipeline_tag: summarization
|
9 |
---
|
10 |
+
|
11 |
+
|
12 |
+
[![INESC-ID](https://www.inesc-id.pt/wp-content/uploads/2019/06/INESC-ID-logo_01.png)](https://www.inesc-id.pt/projects/PR07005/)
|
13 |
+
|
14 |
+
[![A Semantic Search System for Supremo Tribunal de Justiça](https://rufimelo99.github.io/SemanticSearchSystemForSTJ/_static/logo.png)](https://rufimelo99.github.io/SemanticSearchSystemForSTJ/)
|
15 |
+
|
16 |
+
Work developed as part of [Project IRIS](https://www.inesc-id.pt/projects/PR07005/).
|
17 |
+
|
18 |
+
Thesis: [A Semantic Search System for Supremo Tribunal de Justiça](https://rufimelo99.github.io/SemanticSearchSystemForSTJ/)
|
19 |
+
|
20 |
+
# stjiris/t5-portuguese-legal-summarization
|
21 |
+
|
22 |
+
T5 Model fine-tuned over “unicamp-dl/ptt5-base-portuguese-vocab” t5 model.
|
23 |
+
|
24 |
+
We utilized various jurisprudence and its summary to train this model.
|
25 |
+
|
26 |
+
|
27 |
+
## Usage (HuggingFace transformers)
|
28 |
+
```python
|
29 |
+
# name of folder principal
|
30 |
+
from transformers import T5Tokenizer, T5ForConditionalGeneration
|
31 |
+
|
32 |
+
model_checkpoint = "t5_summ_model"
|
33 |
+
t5_model = T5ForConditionalGeneration.from_pretrained(model_checkpoint)
|
34 |
+
t5_tokenizer = T5Tokenizer.from_pretrained(model_checkpoint)
|
35 |
+
|
36 |
+
preprocess_text = "These are some big words and text and words and text, again, that we want to summarize"
|
37 |
+
t5_prepared_Text = "summarize: "+preprocess_text
|
38 |
+
#print ("original text preprocessed: \n", preprocess_text)
|
39 |
+
|
40 |
+
tokenized_text = t5_tokenizer.encode(t5_prepared_Text, return_tensors="pt").to(device)
|
41 |
+
|
42 |
+
|
43 |
+
# summmarize
|
44 |
+
summary_ids = t5_model.generate(tokenized_text,
|
45 |
+
num_beams=4,
|
46 |
+
no_repeat_ngram_size=2,
|
47 |
+
min_length=512,
|
48 |
+
max_length=1024,
|
49 |
+
early_stopping=True)
|
50 |
+
|
51 |
+
output = t5_tokenizer.decode(summary_ids[0], skip_special_tokens=True)
|
52 |
+
|
53 |
+
print ("\n\nSummarized text: \n",output)
|
54 |
+
|
55 |
+
```
|
56 |
+
|
57 |
+
## Citing & Authors
|
58 |
+
|
59 |
+
### Contributions
|
60 |
+
[@rufimelo99](https://github.com/rufimelo99)
|
61 |
+
|
62 |
+
If you use this work, please cite:
|
63 |
+
|
64 |
+
```bibtex
|
65 |
+
@inproceedings{MeloSemantic,
|
66 |
+
author = {Melo, Rui and Santos, Professor Pedro Alexandre and Dias, Professor Jo{\~ a}o},
|
67 |
+
title = {A {Semantic} {Search} {System} for {Supremo} {Tribunal} de {Justi}{\c c}a},
|
68 |
+
}
|
69 |
+
|
70 |
+
@article{ptt5_2020,
|
71 |
+
title={PTT5: Pretraining and validating the T5 model on Brazilian Portuguese data},
|
72 |
+
author={Carmo, Diedre and Piau, Marcos and Campiotti, Israel and Nogueira, Rodrigo and Lotufo, Roberto},
|
73 |
+
journal={arXiv preprint arXiv:2008.09144},
|
74 |
+
year={2020}
|
75 |
+
}
|
76 |
+
|
77 |
+
```
|