rufimelo commited on
Commit
d3be91c
1 Parent(s): 401ca5f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +74 -0
README.md CHANGED
@@ -1,3 +1,77 @@
1
  ---
 
 
 
 
 
2
  license: mit
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - pt
4
+ thumbnail: Portuguese T5 for the Legal Domain
5
+ tags:
6
+ - transformers
7
  license: mit
8
+ pipeline_tag: summarization
9
  ---
10
+
11
+
12
+ [![INESC-ID](https://www.inesc-id.pt/wp-content/uploads/2019/06/INESC-ID-logo_01.png)](https://www.inesc-id.pt/projects/PR07005/)
13
+
14
+ [![A Semantic Search System for Supremo Tribunal de Justiça](https://rufimelo99.github.io/SemanticSearchSystemForSTJ/_static/logo.png)](https://rufimelo99.github.io/SemanticSearchSystemForSTJ/)
15
+
16
+ Work developed as part of [Project IRIS](https://www.inesc-id.pt/projects/PR07005/).
17
+
18
+ Thesis: [A Semantic Search System for Supremo Tribunal de Justiça](https://rufimelo99.github.io/SemanticSearchSystemForSTJ/)
19
+
20
+ # stjiris/t5-portuguese-legal-summarization
21
+
22
+ T5 Model fine-tuned over “unicamp-dl/ptt5-base-portuguese-vocab” t5 model.
23
+
24
+ We utilized various jurisprudence and its summary to train this model.
25
+
26
+
27
+ ## Usage (HuggingFace transformers)
28
+ ```python
29
+ # name of folder principal
30
+ from transformers import T5Tokenizer, T5ForConditionalGeneration
31
+
32
+ model_checkpoint = "t5_summ_model"
33
+ t5_model = T5ForConditionalGeneration.from_pretrained(model_checkpoint)
34
+ t5_tokenizer = T5Tokenizer.from_pretrained(model_checkpoint)
35
+
36
+ preprocess_text = "These are some big words and text and words and text, again, that we want to summarize"
37
+ t5_prepared_Text = "summarize: "+preprocess_text
38
+ #print ("original text preprocessed: \n", preprocess_text)
39
+
40
+ tokenized_text = t5_tokenizer.encode(t5_prepared_Text, return_tensors="pt").to(device)
41
+
42
+
43
+ # summmarize
44
+ summary_ids = t5_model.generate(tokenized_text,
45
+ num_beams=4,
46
+ no_repeat_ngram_size=2,
47
+ min_length=512,
48
+ max_length=1024,
49
+ early_stopping=True)
50
+
51
+ output = t5_tokenizer.decode(summary_ids[0], skip_special_tokens=True)
52
+
53
+ print ("\n\nSummarized text: \n",output)
54
+
55
+ ```
56
+
57
+ ## Citing & Authors
58
+
59
+ ### Contributions
60
+ [@rufimelo99](https://github.com/rufimelo99)
61
+
62
+ If you use this work, please cite:
63
+
64
+ ```bibtex
65
+ @inproceedings{MeloSemantic,
66
+ author = {Melo, Rui and Santos, Professor Pedro Alexandre and Dias, Professor Jo{\~ a}o},
67
+ title = {A {Semantic} {Search} {System} for {Supremo} {Tribunal} de {Justi}{\c c}a},
68
+ }
69
+
70
+ @article{ptt5_2020,
71
+ title={PTT5: Pretraining and validating the T5 model on Brazilian Portuguese data},
72
+ author={Carmo, Diedre and Piau, Marcos and Campiotti, Israel and Nogueira, Rodrigo and Lotufo, Roberto},
73
+ journal={arXiv preprint arXiv:2008.09144},
74
+ year={2020}
75
+ }
76
+
77
+ ```