mapama247 commited on
Commit
04bdd72
1 Parent(s): 61184f9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -23
README.md CHANGED
@@ -1,32 +1,28 @@
1
  ---
2
- language: ca
 
3
  license: apache-2.0
4
  tags:
5
- - "catalan"
6
- - "masked-lm"
7
- - "distilroberta"
8
  widget:
9
- - text: "El Català és una llengua molt <mask>."
10
- - text: "Salvador Dalí va viure a <mask>."
11
- - text: "La Costa Brava té les millors <mask> d'Espanya."
12
- - text: "El cacaolat és un batut de <mask>."
13
- - text: "<mask> és la capital de la Garrotxa."
14
- - text: "Vaig al <mask> a buscar bolets."
15
- - text: "Antoni Gaudí vas ser un <mask> molt important per la ciutat."
16
- - text: "Catalunya és una referència en <mask> a nivell europeu."
17
  ---
18
 
19
  # DistilRoBERTa-base-ca
20
 
21
- ## Overview
22
- - **Architecture:** DistilRoBERTa-base
23
- - **Language:** Catalan
24
- - **Task:** Fill-Mask
25
- - **Data:** Crawling
26
-
27
  ## Model description
28
 
29
  This model is a distilled version of [projecte-aina/roberta-base-ca-v2](https://huggingface.co/projecte-aina/roberta-base-ca-v2).
 
30
  It follows the same training procedure as [DistilBERT](https://arxiv.org/abs/1910.01108), using the implementation of Knowledge Distillation
31
  from the paper's [official repository](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation).
32
 
@@ -38,18 +34,21 @@ This makes the model lighter and faster than the original, at the cost of a slig
38
 
39
  ### Training procedure
40
 
41
- This model has been trained using a technique known as Knowledge Distillation, which is used to shrink networks to a reasonable size while minimizing the loss in performance.
 
42
 
43
- It basically consists in distilling a large language model (the teacher) into a more lightweight, energy-efficient, and production-friendly model (the student).
 
44
 
45
- So, in a “teacher-student learning” setup, a relatively small student model is trained to mimic the behavior of a larger teacher model. As a result, the student has lower inference time and the ability to run in commodity hardware.
 
46
 
47
  ### Training data
48
 
49
  The training corpus consists of several corpora gathered from web crawling and public corpora, as shown in the table below:
50
 
51
  | Corpus | Size (GB) |
52
- |--------------------------|------------|
53
  | Catalan Crawling | 13.00 |
54
  | RacoCatalá | 8.10 |
55
  | Catalan Oscar | 4.00 |
@@ -90,4 +89,4 @@ This is how it compares to its teacher when fine-tuned on the aforementioned dow
90
  | RoBERTa-base-ca-v2 | 89.29 | 98.96 | 79.07 | 74.26 | 83.14 | 89.50/76.63 | 73.64/55.42 |
91
  | DistilRoBERTa-base-ca | 87.88 | 98.83 | 77.26 | 73.20 | 76.00 | 84.07/70.77 | 62.93/45.08 |
92
 
93
- <sup>1</sup> : Trained on CatalanQA, tested on XQuAD-ca.
 
1
  ---
2
+ language:
3
+ - ca
4
  license: apache-2.0
5
  tags:
6
+ - catalan
7
+ - masked-lm
8
+ - distilroberta
9
  widget:
10
+ - text: El Català és una llengua molt <mask>.
11
+ - text: Salvador Dalí va viure a <mask>.
12
+ - text: La Costa Brava té les millors <mask> d'Espanya.
13
+ - text: El cacaolat és un batut de <mask>.
14
+ - text: <mask> és la capital de la Garrotxa.
15
+ - text: Vaig al <mask> a buscar bolets.
16
+ - text: Antoni Gaudí vas ser un <mask> molt important per la ciutat.
17
+ - text: Catalunya és una referència en <mask> a nivell europeu.
18
  ---
19
 
20
  # DistilRoBERTa-base-ca
21
 
 
 
 
 
 
 
22
  ## Model description
23
 
24
  This model is a distilled version of [projecte-aina/roberta-base-ca-v2](https://huggingface.co/projecte-aina/roberta-base-ca-v2).
25
+
26
  It follows the same training procedure as [DistilBERT](https://arxiv.org/abs/1910.01108), using the implementation of Knowledge Distillation
27
  from the paper's [official repository](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation).
28
 
 
34
 
35
  ### Training procedure
36
 
37
+ This model has been trained using a technique known as Knowledge Distillation,
38
+ which is used to shrink networks to a reasonable size while minimizing the loss in performance.
39
 
40
+ It basically consists in distilling a large language model (the teacher) into a more
41
+ lightweight, energy-efficient, and production-friendly model (the student).
42
 
43
+ So, in a “teacher-student learning” setup, a relatively small student model is trained to mimic the behavior of a larger teacher model.
44
+ As a result, the student has lower inference time and the ability to run in commodity hardware.
45
 
46
  ### Training data
47
 
48
  The training corpus consists of several corpora gathered from web crawling and public corpora, as shown in the table below:
49
 
50
  | Corpus | Size (GB) |
51
+ |--------------------------|-----------:|
52
  | Catalan Crawling | 13.00 |
53
  | RacoCatalá | 8.10 |
54
  | Catalan Oscar | 4.00 |
 
89
  | RoBERTa-base-ca-v2 | 89.29 | 98.96 | 79.07 | 74.26 | 83.14 | 89.50/76.63 | 73.64/55.42 |
90
  | DistilRoBERTa-base-ca | 87.88 | 98.83 | 77.26 | 73.20 | 76.00 | 84.07/70.77 | 62.93/45.08 |
91
 
92
+ <sup>1</sup> : Trained on CatalanQA, tested on XQuAD-ca (no train set).