metadata
language:
- pt
thumbnail: Portuguese T5 for the Legal Domain
tags:
- transformers
license: mit
pipeline_tag: summarization
Work developed as part of Project IRIS.
Thesis: A Semantic Search System for Supremo Tribunal de Justiça
stjiris/t5-portuguese-legal-summarization
T5 Model fine-tuned over “unicamp-dl/ptt5-base-portuguese-vocab” t5 model.
We utilized various jurisprudence and its summary to train this model.
Usage (HuggingFace transformers)
# name of folder principal
from transformers import T5Tokenizer, T5ForConditionalGeneration
model_checkpoint = "stjiris/t5-portuguese-legal-summarization"
t5_model = T5ForConditionalGeneration.from_pretrained(model_checkpoint)
t5_tokenizer = T5Tokenizer.from_pretrained(model_checkpoint)
preprocess_text = "These are some big words and text and words and text, again, that we want to summarize"
t5_prepared_Text = "summarize: "+preprocess_text
#print ("original text preprocessed: \n", preprocess_text)
tokenized_text = t5_tokenizer.encode(t5_prepared_Text, return_tensors="pt").to(device)
# summmarize
summary_ids = t5_model.generate(tokenized_text,
num_beams=4,
no_repeat_ngram_size=2,
min_length=512,
max_length=1024,
early_stopping=True)
output = t5_tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print ("\n\nSummarized text: \n",output)
Citing & Authors
Contributions
If you use this work, please cite:
@inproceedings{MeloSemantic,
author = {Melo, Rui and Santos, Professor Pedro Alexandre and Dias, Professor Jo{\~ a}o},
title = {A {Semantic} {Search} {System} for {Supremo} {Tribunal} de {Justi}{\c c}a},
}
@article{ptt5_2020,
title={PTT5: Pretraining and validating the T5 model on Brazilian Portuguese data},
author={Carmo, Diedre and Piau, Marcos and Campiotti, Israel and Nogueira, Rodrigo and Lotufo, Roberto},
journal={arXiv preprint arXiv:2008.09144},
year={2020}
}