File size: 4,309 Bytes
1ced4b7
62dd987
1ced4b7
62dd987
 
 
 
 
 
 
 
 
 
 
 
 
1ced4b7
62dd987
375961d
62dd987
 
93b6134
54fee6e
 
 
62dd987
 
 
61184f9
 
 
 
 
 
 
62dd987
 
 
90f4cae
 
 
 
 
 
6284fca
90f4cae
62dd987
 
69aa348
92070db
15b1de2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62dd987
 
 
9f57e6c
 
69aa348
d987a1e
 
 
3961240
 
d987a1e
 
3961240
d987a1e
80c0538
 
62dd987
 
c6b7ead
d987a1e
3961240
 
 
ee20fff
62dd987
61184f9
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
---
language: ca
license: apache-2.0
tags:
- "catalan"
- "masked-lm"
- "distilroberta"
widget:
- text: "El Català és una llengua molt <mask>."
- text: "Salvador Dalí va viure a <mask>."
- text: "La Costa Brava té les millors <mask> d'Espanya."
- text: "El cacaolat és un batut de <mask>."
- text: "<mask> és la capital de la Garrotxa."
- text: "Vaig al <mask> a buscar bolets."
- text: "Antoni Gaudí vas ser un <mask> molt important per la ciutat."
- text: "Catalunya és una referència en <mask> a nivell europeu."
---

# DistilRoBERTa-base-ca

## Overview
- **Architecture:** DistilRoBERTa-base
- **Language:** Catalan
- **Task:** Fill-Mask
- **Data:** Crawling

## Model description

This model is a distilled version of [projecte-aina/roberta-base-ca-v2](https://huggingface.co/projecte-aina/roberta-base-ca-v2). 
It follows the same training procedure as [DistilBERT](https://arxiv.org/abs/1910.01108), using the implementation of Knowledge Distillation 
from the paper's [official repository](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation).

The resulting architecture consists of 6 layers, 768 dimensional embeddings and 12 attention heads. 
This adds up to a total of 82M parameters, which is considerably less than the 125M of standard RoBERTa-base models. 
This makes the model lighter and faster than the original, at the cost of a slightly lower performance.

## Training

### Training procedure

This model has been trained using a technique known as Knowledge Distillation, which is used to shrink networks to a reasonable size while minimizing the loss in performance.

It basically consists in distilling a large language model (the teacher) into a more lightweight, energy-efficient, and production-friendly model (the student).

So, in a “teacher-student learning” setup, a relatively small student model is trained to mimic the behavior of a larger teacher model. As a result, the student has lower inference time and the ability to run in commodity hardware.

### Training data

The training corpus consists of several corpora gathered from web crawling and public corpora, as shown in the table below:

| Corpus                   | Size (GB)  |
|--------------------------|------------|
| Catalan Crawling         | 13.00      |
| RacoCatalá               | 8.10       |
| Catalan Oscar            | 4.00       |
| CaWaC                    | 3.60       |
| Cat. General Crawling    | 2.50       |
| Wikipedia                | 1.10       |
| DOGC                     | 0.78       |
| Padicat                  | 0.63       |
| ACN                      | 0.42       |
| Nació Digital            | 0.42       |
| Cat. Government Crawling | 0.24       |
| Vilaweb                  | 0.06       |
| Catalan Open Subtitles   | 0.02       |
| Tweets                   | 0.02       |

## Evaluation

### Evaluation benchmark

This model has been fine-tuned on the downstream tasks of the [Catalan Language Understanding Evaluation benchmark (CLUB)](https://club.aina.bsc.es/), which includes the following datasets:

| Dataset   | Task| Total   | Train  | Dev   | Test  |
|:----------|:----|:--------|:-------|:------|:------|
| AnCora    | NER | 13,581  | 10,628 | 1,427 | 1,526 |
| AnCora    | POS | 16,678  | 13,123 | 1,709 | 1,846 |
| STS-ca    | STS | 3,073   | 2,073  | 500   | 500   |
| TeCla     | TC  | 137,775 | 110,203| 13,786| 13,786|
| TE-ca     | RTE | 21,163  | 16,930 | 2,116 | 2,117 |
| CatalanQA | QA  | 21,427  | 17,135 | 2,157 | 2,135 |
| XQuAD-ca  | QA  |   -     |   -    |   -   | 1,189 |

### Evaluation results

This is how it compares to its teacher when fine-tuned on the aforementioned downstream tasks:

|      Model  \  Task     |NER (F1)|POS (F1)|STS-ca (Comb.)|TeCla (Acc.)|TEca (Acc.)|CatalanQA (F1/EM)| XQuAD-ca <sup>1</sup> (F1/EM) | 
| ------------------------|:-------|:-------|:-------------|:-----------|:----------|:----------------|:------------------------------|
| RoBERTa-base-ca-v2      | 89.29  | 98.96  | 79.07        | 74.26      | 83.14     | 89.50/76.63     | 73.64/55.42                   |
| DistilRoBERTa-base-ca   | 87.88  | 98.83  | 77.26        | 73.20      | 76.00     | 84.07/70.77     | 62.93/45.08                   |

<sup>1</sup> : Trained on CatalanQA, tested on XQuAD-ca.