File size: 2,583 Bytes
3c544de
d3be91c
 
 
 
 
3c544de
d3be91c
3c544de
d3be91c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78f84b0
d3be91c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
---
language:
- pt
thumbnail: Portuguese T5 for the Legal Domain
tags:
- transformers
license: mit
pipeline_tag: summarization
---


[![INESC-ID](https://www.inesc-id.pt/wp-content/uploads/2019/06/INESC-ID-logo_01.png)](https://www.inesc-id.pt/projects/PR07005/)

[![A Semantic Search System for Supremo Tribunal de Justiça](https://rufimelo99.github.io/SemanticSearchSystemForSTJ/_static/logo.png)](https://rufimelo99.github.io/SemanticSearchSystemForSTJ/)

Work developed as part of [Project IRIS](https://www.inesc-id.pt/projects/PR07005/).

Thesis: [A Semantic Search System for Supremo Tribunal de Justiça](https://rufimelo99.github.io/SemanticSearchSystemForSTJ/)

# stjiris/t5-portuguese-legal-summarization

T5 Model fine-tuned over “unicamp-dl/ptt5-base-portuguese-vocab” t5 model.

We utilized various jurisprudence and its summary to train this model.


## Usage (HuggingFace transformers)
```python
# name of folder principal
from transformers import T5Tokenizer, T5ForConditionalGeneration

model_checkpoint = "stjiris/t5-portuguese-legal-summarization"
t5_model = T5ForConditionalGeneration.from_pretrained(model_checkpoint)
t5_tokenizer = T5Tokenizer.from_pretrained(model_checkpoint)

preprocess_text = "These are some big words and text and words and text, again, that we want to summarize"
t5_prepared_Text = "summarize: "+preprocess_text
#print ("original text preprocessed: \n", preprocess_text)

tokenized_text = t5_tokenizer.encode(t5_prepared_Text, return_tensors="pt").to(device)


# summmarize 
summary_ids = t5_model.generate(tokenized_text,
                                    num_beams=4,
                                    no_repeat_ngram_size=2,
                                    min_length=512,
                                    max_length=1024,
                                    early_stopping=True)

output = t5_tokenizer.decode(summary_ids[0], skip_special_tokens=True)

print ("\n\nSummarized text: \n",output)

```

## Citing & Authors

### Contributions
[@rufimelo99](https://github.com/rufimelo99)

If you use this work, please cite:

```bibtex
@inproceedings{MeloSemantic,
	author = {Melo, Rui and Santos, Professor Pedro Alexandre and Dias, Professor Jo{\~ a}o},
	title = {A {Semantic} {Search} {System} for {Supremo} {Tribunal} de {Justi}{\c c}a},
}

@article{ptt5_2020,
  title={PTT5: Pretraining and validating the T5 model on Brazilian Portuguese data},
  author={Carmo, Diedre and Piau, Marcos and Campiotti, Israel and Nogueira, Rodrigo and Lotufo, Roberto},
  journal={arXiv preprint arXiv:2008.09144},
  year={2020}
}

```