eduagarcia
commited on
Commit
•
a197b92
1
Parent(s):
fd8f5e0
Update
Browse files
README.md
CHANGED
@@ -62,7 +62,33 @@ RoBERTaLexPT-base is pretrained from LegalPT and CrawlPT corpora, using [RoBERTa
|
|
62 |
- **Language(s) (NLP):** Brazilian Portuguese (pt-BR)
|
63 |
- **License:** [Creative Commons Attribution 4.0 International Public License](https://creativecommons.org/licenses/by/4.0/deed.en)
|
64 |
- **Repository:** https://github.com/eduagarcia/roberta-legal-portuguese
|
65 |
-
- **Paper:** [
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
66 |
|
67 |
## Training Details
|
68 |
|
@@ -84,14 +110,13 @@ Following the approach of [Lee et al. (2022)](http://arxiv.org/abs/2107.06499),
|
|
84 |
|
85 |
To ensure that domain models are not constrained by a generic vocabulary, we utilized the [HuggingFace Tokenizers](https://github.com/huggingface/tokenizers) -- BPE algorithm to train a vocabulary for each pre-training corpus used.
|
86 |
|
87 |
-
|
88 |
#### Training Hyperparameters
|
89 |
|
90 |
The pretraining process involved training the model for 62,500 steps, with a batch size of 2048 sequences, each containing a maximum of 512 tokens.
|
91 |
We employed the masked language modeling objective, where 15\% of the input tokens were randomly masked.
|
92 |
The optimization was performed using the AdamW optimizer with a linear warmup and a linear decay learning rate schedule.
|
93 |
|
94 |
-
|
95 |
|
96 |
|
97 |
| **Hyperparameter** | **RoBERTa-base** |
|
@@ -114,35 +139,24 @@ We adopted the standard [RoBERTa hyperparameters](https://arxiv.org/abs/1907.116
|
|
114 |
| AdamW $$\beta_2$$ | 0.98 |
|
115 |
| Gradient clipping | 0.0 |
|
116 |
|
117 |
-
## Evaluation
|
118 |
-
|
119 |
-
The model was evaluated on ["PortuLex" benchmark](eduagarcia/portuguese_benchmark), a four-task benchmark designed to evaluate the quality and performance of language models in the Portuguese legal domain.
|
120 |
-
|
121 |
-
Macro F1-Score (\%) for multiple models evaluated on PortuLex benchmark test splits:
|
122 |
-
|
123 |
-
| **Model** | **LeNER** | **UlyNER-PL** | **FGV-STF** | **RRIP** | **Average (%)** |
|
124 |
-
|----------------------------------------------------------------------------|-----------|-----------------|-------------|:---------:|-----------------|
|
125 |
-
| | | Coarse/Fine | Coarse | | |
|
126 |
-
| [BERTimbau-base](https://dl.acm.org/doi/abs/10.1007/978-3-030-61377-8_28) | 88.34 | 86.39/83.83 | 79.34 | 82.34 | 83.78 |
|
127 |
-
| [BERTimbau-large](https://dl.acm.org/doi/abs/10.1007/978-3-030-61377-8_28) | 88.64 | 87.77/84.74 | 79.71 | **83.79** | 84.60 |
|
128 |
-
| [Albertina-PT-BR-base](https://arxiv.org/abs/2305.06721) | 89.26 | 86.35/84.63 | 79.30 | 81.16 | 83.80 |
|
129 |
-
| [Albertina-PT-BR-xlarge](https://arxiv.org/abs/2305.06721) | 90.09 | 88.36/**86.62** | 79.94 | 82.79 | 85.08 |
|
130 |
-
| [BERTikal-base](https://arxiv.org/abs/2110.15709) | 83.68 | 79.21/75.70 | 77.73 | 81.11 | 79.99 |
|
131 |
-
| [JurisBERT-base](https://repositorio.ufms.br/handle/123456789/5119) | 81.74 | 81.67/77.97 | 76.04 | 80.85 | 79.61 |
|
132 |
-
| [BERTimbauLAW-base](https://repositorio.ufms.br/handle/123456789/5119) | 84.90 | 87.11/84.42 | 79.78 | 82.35 | 83.20 |
|
133 |
-
| [Legal-XLM-R-base](https://arxiv.org/abs/2306.02069) | 87.48 | 83.49/83.16 | 79.79 | 82.35 | 83.24 |
|
134 |
-
| [Legal-XLM-R-large](https://arxiv.org/abs/2306.02069) | 88.39 | 84.65/84.55 | 79.36 | 81.66 | 83.50 |
|
135 |
-
| [Legal-RoBERTa-PT-large](https://arxiv.org/abs/2306.02069) | 87.96 | 88.32/84.83 | 79.57 | 81.98 | 84.02 |
|
136 |
-
| RoBERTaTimbau-base | 89.68 | 87.53/85.74 | 78.82 | 82.03 | 84.29 |
|
137 |
-
| RoBERTaLegalPT-base | 90.59 | 85.45/84.40 | 79.92 | 82.84 | 84.57 |
|
138 |
-
| RoBERTaLexPT-base | **90.73** | **88.56**/86.03 | **80.40** | 83.22 | **85.41** |
|
139 |
-
|
140 |
-
In summary, RoBERTaLexPT consistently achieves top legal NLP effectiveness despite its base size.
|
141 |
-
With sufficient pre-training data, it can surpass overparameterized models. The results highlight the importance of domain-diverse training data over sheer model scale.
|
142 |
-
|
143 |
## Citation
|
144 |
|
145 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
146 |
|
147 |
## Acknowledgment
|
148 |
|
|
|
62 |
- **Language(s) (NLP):** Brazilian Portuguese (pt-BR)
|
63 |
- **License:** [Creative Commons Attribution 4.0 International Public License](https://creativecommons.org/licenses/by/4.0/deed.en)
|
64 |
- **Repository:** https://github.com/eduagarcia/roberta-legal-portuguese
|
65 |
+
- **Paper:** [Coming soon]
|
66 |
+
|
67 |
+
## Evaluation
|
68 |
+
|
69 |
+
The model was evaluated on ["PortuLex" benchmark](eduagarcia/portuguese_benchmark), a four-task benchmark designed to evaluate the quality and performance of language models in the Portuguese legal domain.
|
70 |
+
|
71 |
+
Macro F1-Score (\%) for multiple models evaluated on PortuLex benchmark test splits:
|
72 |
+
|
73 |
+
| **Model** | **LeNER** | **UlyNER-PL** | **FGV-STF** | **RRIP** | **Average (%)** |
|
74 |
+
|----------------------------------------------------------------------------|-----------|-----------------|-------------|:---------:|-----------------|
|
75 |
+
| | | Coarse/Fine | Coarse | | |
|
76 |
+
| [BERTimbau-base](https://dl.acm.org/doi/abs/10.1007/978-3-030-61377-8_28) | 88.34 | 86.39/83.83 | 79.34 | 82.34 | 83.78 |
|
77 |
+
| [BERTimbau-large](https://dl.acm.org/doi/abs/10.1007/978-3-030-61377-8_28) | 88.64 | 87.77/84.74 | 79.71 | **83.79** | 84.60 |
|
78 |
+
| [Albertina-PT-BR-base](https://arxiv.org/abs/2305.06721) | 89.26 | 86.35/84.63 | 79.30 | 81.16 | 83.80 |
|
79 |
+
| [Albertina-PT-BR-xlarge](https://arxiv.org/abs/2305.06721) | 90.09 | 88.36/**86.62** | 79.94 | 82.79 | 85.08 |
|
80 |
+
| [BERTikal-base](https://arxiv.org/abs/2110.15709) | 83.68 | 79.21/75.70 | 77.73 | 81.11 | 79.99 |
|
81 |
+
| [JurisBERT-base](https://repositorio.ufms.br/handle/123456789/5119) | 81.74 | 81.67/77.97 | 76.04 | 80.85 | 79.61 |
|
82 |
+
| [BERTimbauLAW-base](https://repositorio.ufms.br/handle/123456789/5119) | 84.90 | 87.11/84.42 | 79.78 | 82.35 | 83.20 |
|
83 |
+
| [Legal-XLM-R-base](https://arxiv.org/abs/2306.02069) | 87.48 | 83.49/83.16 | 79.79 | 82.35 | 83.24 |
|
84 |
+
| [Legal-XLM-R-large](https://arxiv.org/abs/2306.02069) | 88.39 | 84.65/84.55 | 79.36 | 81.66 | 83.50 |
|
85 |
+
| [Legal-RoBERTa-PT-large](https://arxiv.org/abs/2306.02069) | 87.96 | 88.32/84.83 | 79.57 | 81.98 | 84.02 |
|
86 |
+
| RoBERTaTimbau-base | 89.68 | 87.53/85.74 | 78.82 | 82.03 | 84.29 |
|
87 |
+
| RoBERTaLegalPT-base | 90.59 | 85.45/84.40 | 79.92 | 82.84 | 84.57 |
|
88 |
+
| RoBERTaLexPT-base | **90.73** | **88.56**/86.03 | **80.40** | 83.22 | **85.41** |
|
89 |
+
|
90 |
+
In summary, RoBERTaLexPT consistently achieves top legal NLP effectiveness despite its base size.
|
91 |
+
With sufficient pre-training data, it can surpass overparameterized models. The results highlight the importance of domain-diverse training data over sheer model scale.
|
92 |
|
93 |
## Training Details
|
94 |
|
|
|
110 |
|
111 |
To ensure that domain models are not constrained by a generic vocabulary, we utilized the [HuggingFace Tokenizers](https://github.com/huggingface/tokenizers) -- BPE algorithm to train a vocabulary for each pre-training corpus used.
|
112 |
|
|
|
113 |
#### Training Hyperparameters
|
114 |
|
115 |
The pretraining process involved training the model for 62,500 steps, with a batch size of 2048 sequences, each containing a maximum of 512 tokens.
|
116 |
We employed the masked language modeling objective, where 15\% of the input tokens were randomly masked.
|
117 |
The optimization was performed using the AdamW optimizer with a linear warmup and a linear decay learning rate schedule.
|
118 |
|
119 |
+
For other hyperparameters we adopted the standard [RoBERTa hyperparameters](https://arxiv.org/abs/1907.11692):
|
120 |
|
121 |
|
122 |
| **Hyperparameter** | **RoBERTa-base** |
|
|
|
139 |
| AdamW $$\beta_2$$ | 0.98 |
|
140 |
| Gradient clipping | 0.0 |
|
141 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
142 |
## Citation
|
143 |
|
144 |
+
```
|
145 |
+
@InProceedings{garcia2024_roberlexpt,
|
146 |
+
author="Garcia, Eduardo A. S.
|
147 |
+
and Silva, N{\'a}dia F. F.
|
148 |
+
and Siqueira, Felipe
|
149 |
+
and Gomes, Juliana R. S.
|
150 |
+
and Albuqueruqe, Hidelberg O.
|
151 |
+
and Souza, Ellen
|
152 |
+
and Lima, Eliomar
|
153 |
+
and De Carvalho, André",
|
154 |
+
title="RoBERTaLexPT: A Legal RoBERTa Model pretrained with deduplication for Portuguese",
|
155 |
+
booktitle="Computational Processing of the Portuguese Language",
|
156 |
+
year="2024",
|
157 |
+
publisher="Association for Computational Linguistics"
|
158 |
+
}
|
159 |
+
```
|
160 |
|
161 |
## Acknowledgment
|
162 |
|