asier-gutierrez commited on
Commit
0f27dde
1 Parent(s): 07bddc0

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +25 -0
README.md ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - es
4
+ license: {cc4_0}
5
+ tags:
6
+ - {language model} # Example: audio
7
+ - {legal} # Example: automatic-speech-recognition
8
+ - {spanish} # Example: speech
9
+ datasets:
10
+ - {legal_ES+temu} # Example: common_voice
11
+ metrics:
12
+ - {ppl} # Example: wer
13
+ ---
14
+ # Spanish Legal-domain RoBERTa
15
+
16
+ There are two main models made specifically for the Spanish language, the BETO model and a GPT-2. There is also a multilingual BERT (mBERT) that is often used as it might be better sometimes.
17
+
18
+ Both BETO and GPT-2 models for Spanish have been trained with rather low resources, 4GB and 3GB of data respectively. The data used for training both models might be various but the amount is not enough to cover all domains. Furthermore, training a BERT-like domain-specific model is better as it effectively covers the vocabulary and understands the legal jargon. We present our models trained on 9GB that are specifically of the legal domain.
19
+
20
+ ## Citing
21
+ ```
22
+ TBA
23
+ ```
24
+
25
+ For more information visit our [GitHub repository](https://github.com/PlanTL-SANIDAD/lm-legal-es)