eduagarcia
/

RoBERTaLexPT-base

@@ -62,7 +62,33 @@ RoBERTaLexPT-base is pretrained from LegalPT and CrawlPT corpora, using [RoBERTa
 - **Language(s) (NLP):** Brazilian Portuguese (pt-BR)
 - **License:** [Creative Commons Attribution 4.0 International Public License](https://creativecommons.org/licenses/by/4.0/deed.en)
 - **Repository:** https://github.com/eduagarcia/roberta-legal-portuguese
-- **Paper:** [More Information Needed]
 ## Training Details
@@ -84,14 +110,13 @@ Following the approach of [Lee et al. (2022)](http://arxiv.org/abs/2107.06499),
 To ensure that domain models are not constrained by a generic vocabulary, we utilized the [HuggingFace Tokenizers](https://github.com/huggingface/tokenizers) -- BPE algorithm to train a vocabulary for each pre-training corpus used.
 #### Training Hyperparameters
 The pretraining process involved training the model for 62,500 steps, with a batch size of 2048 sequences, each containing a maximum of 512 tokens.
 We employed the masked language modeling objective, where 15\% of the input tokens were randomly masked.
 The optimization was performed using the AdamW optimizer with a linear warmup and a linear decay learning rate schedule.
-We adopted the standard [RoBERTa hyperparameters](https://arxiv.org/abs/1907.11692):
 | **Hyperparameter**     | **RoBERTa-base** |
@@ -114,35 +139,24 @@ We adopted the standard [RoBERTa hyperparameters](https://arxiv.org/abs/1907.116
 | AdamW $$\beta_2$$      |             0.98 |
 | Gradient clipping      |              0.0 |
-## Evaluation
-The model was evaluated on ["PortuLex" benchmark](eduagarcia/portuguese_benchmark), a four-task benchmark designed to evaluate the quality and performance of language models in the Portuguese legal domain.
-Macro F1-Score (\%) for multiple models evaluated on PortuLex benchmark test splits:
-| **Model**                                                                  | **LeNER** | **UlyNER-PL**   | **FGV-STF** |  **RRIP** | **Average (%)** |
-|----------------------------------------------------------------------------|-----------|-----------------|-------------|:---------:|-----------------|
-|                                                                            |           | Coarse/Fine     | Coarse      |           |                 |
-| [BERTimbau-base](https://dl.acm.org/doi/abs/10.1007/978-3-030-61377-8_28)  | 88.34     | 86.39/83.83     | 79.34       |   82.34   | 83.78           |
-| [BERTimbau-large](https://dl.acm.org/doi/abs/10.1007/978-3-030-61377-8_28) | 88.64     | 87.77/84.74     | 79.71       | **83.79** | 84.60           |
-| [Albertina-PT-BR-base](https://arxiv.org/abs/2305.06721)                   | 89.26     | 86.35/84.63     | 79.30       |   81.16   | 83.80           |
-| [Albertina-PT-BR-xlarge](https://arxiv.org/abs/2305.06721)                 | 90.09     | 88.36/**86.62** | 79.94       |   82.79   | 85.08           |
-| [BERTikal-base](https://arxiv.org/abs/2110.15709)                          | 83.68     | 79.21/75.70     | 77.73       |   81.11   | 79.99           |
-| [JurisBERT-base](https://repositorio.ufms.br/handle/123456789/5119)        | 81.74     | 81.67/77.97     | 76.04       |   80.85   | 79.61           |
-| [BERTimbauLAW-base](https://repositorio.ufms.br/handle/123456789/5119)     | 84.90     | 87.11/84.42     | 79.78       |   82.35   | 83.20           |
-| [Legal-XLM-R-base](https://arxiv.org/abs/2306.02069)                       | 87.48     | 83.49/83.16     | 79.79       |   82.35   | 83.24           |
-| [Legal-XLM-R-large](https://arxiv.org/abs/2306.02069)                      | 88.39     | 84.65/84.55     | 79.36       |   81.66   | 83.50           |
-| [Legal-RoBERTa-PT-large](https://arxiv.org/abs/2306.02069)                 | 87.96     | 88.32/84.83     | 79.57       |   81.98   | 84.02           |
-| RoBERTaTimbau-base                                                         | 89.68     | 87.53/85.74     | 78.82       |   82.03   | 84.29           |
-| RoBERTaLegalPT-base                                                        | 90.59     | 85.45/84.40     | 79.92       |   82.84   | 84.57           |
-| RoBERTaLexPT-base                                                          | **90.73** | **88.56**/86.03 | **80.40**   |   83.22   | **85.41**       |
-In summary, RoBERTaLexPT consistently achieves top legal NLP effectiveness despite its base size.
-With sufficient pre-training data, it can surpass overparameterized models. The results highlight the importance of domain-diverse training data over sheer model scale.
 ## Citation
-[More Information Needed]
 ## Acknowledgment

 - **Language(s) (NLP):** Brazilian Portuguese (pt-BR)
 - **License:** [Creative Commons Attribution 4.0 International Public License](https://creativecommons.org/licenses/by/4.0/deed.en)
 - **Repository:** https://github.com/eduagarcia/roberta-legal-portuguese
+- **Paper:** [Coming soon]
+## Evaluation
+The model was evaluated on ["PortuLex" benchmark](eduagarcia/portuguese_benchmark), a four-task benchmark designed to evaluate the quality and performance of language models in the Portuguese legal domain.
+Macro F1-Score (\%) for multiple models evaluated on PortuLex benchmark test splits:
+| **Model**                                                                  | **LeNER** | **UlyNER-PL**   | **FGV-STF** |  **RRIP** | **Average (%)** |
+|----------------------------------------------------------------------------|-----------|-----------------|-------------|:---------:|-----------------|
+|                                                                            |           | Coarse/Fine     | Coarse      |           |                 |
+| [BERTimbau-base](https://dl.acm.org/doi/abs/10.1007/978-3-030-61377-8_28)  | 88.34     | 86.39/83.83     | 79.34       |   82.34   | 83.78           |
+| [BERTimbau-large](https://dl.acm.org/doi/abs/10.1007/978-3-030-61377-8_28) | 88.64     | 87.77/84.74     | 79.71       | **83.79** | 84.60           |
+| [Albertina-PT-BR-base](https://arxiv.org/abs/2305.06721)                   | 89.26     | 86.35/84.63     | 79.30       |   81.16   | 83.80           |
+| [Albertina-PT-BR-xlarge](https://arxiv.org/abs/2305.06721)                 | 90.09     | 88.36/**86.62** | 79.94       |   82.79   | 85.08           |
+| [BERTikal-base](https://arxiv.org/abs/2110.15709)                          | 83.68     | 79.21/75.70     | 77.73       |   81.11   | 79.99           |
+| [JurisBERT-base](https://repositorio.ufms.br/handle/123456789/5119)        | 81.74     | 81.67/77.97     | 76.04       |   80.85   | 79.61           |
+| [BERTimbauLAW-base](https://repositorio.ufms.br/handle/123456789/5119)     | 84.90     | 87.11/84.42     | 79.78       |   82.35   | 83.20           |
+| [Legal-XLM-R-base](https://arxiv.org/abs/2306.02069)                       | 87.48     | 83.49/83.16     | 79.79       |   82.35   | 83.24           |
+| [Legal-XLM-R-large](https://arxiv.org/abs/2306.02069)                      | 88.39     | 84.65/84.55     | 79.36       |   81.66   | 83.50           |
+| [Legal-RoBERTa-PT-large](https://arxiv.org/abs/2306.02069)                 | 87.96     | 88.32/84.83     | 79.57       |   81.98   | 84.02           |
+| RoBERTaTimbau-base                                                         | 89.68     | 87.53/85.74     | 78.82       |   82.03   | 84.29           |
+| RoBERTaLegalPT-base                                                        | 90.59     | 85.45/84.40     | 79.92       |   82.84   | 84.57           |
+| RoBERTaLexPT-base                                                          | **90.73** | **88.56**/86.03 | **80.40**   |   83.22   | **85.41**       |
+In summary, RoBERTaLexPT consistently achieves top legal NLP effectiveness despite its base size.
+With sufficient pre-training data, it can surpass overparameterized models. The results highlight the importance of domain-diverse training data over sheer model scale.
 ## Training Details
 To ensure that domain models are not constrained by a generic vocabulary, we utilized the [HuggingFace Tokenizers](https://github.com/huggingface/tokenizers) -- BPE algorithm to train a vocabulary for each pre-training corpus used.
 #### Training Hyperparameters
 The pretraining process involved training the model for 62,500 steps, with a batch size of 2048 sequences, each containing a maximum of 512 tokens.
 We employed the masked language modeling objective, where 15\% of the input tokens were randomly masked.
 The optimization was performed using the AdamW optimizer with a linear warmup and a linear decay learning rate schedule.
+For other hyperparameters we adopted the standard [RoBERTa hyperparameters](https://arxiv.org/abs/1907.11692):
 | **Hyperparameter**     | **RoBERTa-base** |
 | AdamW $$\beta_2$$      |             0.98 |
 | Gradient clipping      |              0.0 |
 ## Citation
+```
+@InProceedings{garcia2024_roberlexpt,
+    author="Garcia, Eduardo A. S.
+    and Silva, N{\'a}dia F. F.
+    and Siqueira, Felipe
+    and Gomes, Juliana R. S.
+    and Albuqueruqe, Hidelberg O.
+    and Souza, Ellen
+    and Lima, Eliomar
+    and De Carvalho, André",
+    title="RoBERTaLexPT: A Legal RoBERTa Model pretrained with deduplication for Portuguese",
+    booktitle="Computational Processing of the Portuguese Language",
+    year="2024",
+    publisher="Association for Computational Linguistics"
+}
+```
 ## Acknowledgment