mapama247
/

DistilBERTa

Catalan

catalan

masked-lm

distilroberta

Model card Files Files and versions Community

mapama247 commited on Sep 26, 2024

Commit

04bdd72

verified ·

1 Parent(s): 61184f9

Update README.md

Browse files

Files changed (1) hide show

README.md +22 -23

README.md CHANGED Viewed

@@ -1,32 +1,28 @@
 ---
-language: ca
 license: apache-2.0
 tags:
-- "catalan"
-- "masked-lm"
-- "distilroberta"
 widget:
-- text: "El Català és una llengua molt <mask>."
-- text: "Salvador Dalí va viure a <mask>."
-- text: "La Costa Brava té les millors <mask> d'Espanya."
-- text: "El cacaolat és un batut de <mask>."
-- text: "<mask> és la capital de la Garrotxa."
-- text: "Vaig al <mask> a buscar bolets."
-- text: "Antoni Gaudí vas ser un <mask> molt important per la ciutat."
-- text: "Catalunya és una referència en <mask> a nivell europeu."
 ---
 # DistilRoBERTa-base-ca
-## Overview
-- **Architecture:** DistilRoBERTa-base
-- **Language:** Catalan
-- **Task:** Fill-Mask
-- **Data:** Crawling
 ## Model description
 This model is a distilled version of [projecte-aina/roberta-base-ca-v2](https://huggingface.co/projecte-aina/roberta-base-ca-v2).
 It follows the same training procedure as [DistilBERT](https://arxiv.org/abs/1910.01108), using the implementation of Knowledge Distillation
 from the paper's [official repository](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation).
@@ -38,18 +34,21 @@ This makes the model lighter and faster than the original, at the cost of a slig
 ### Training procedure
-This model has been trained using a technique known as Knowledge Distillation, which is used to shrink networks to a reasonable size while minimizing the loss in performance.
-It basically consists in distilling a large language model (the teacher) into a more lightweight, energy-efficient, and production-friendly model (the student).
-So, in a “teacher-student learning” setup, a relatively small student model is trained to mimic the behavior of a larger teacher model. As a result, the student has lower inference time and the ability to run in commodity hardware.
 ### Training data
 The training corpus consists of several corpora gathered from web crawling and public corpora, as shown in the table below:
 | Corpus                   | Size (GB)  |
-|--------------------------|------------|
 | Catalan Crawling         | 13.00      |
 | RacoCatalá               | 8.10       |
 | Catalan Oscar            | 4.00       |
@@ -90,4 +89,4 @@ This is how it compares to its teacher when fine-tuned on the aforementioned dow
 | RoBERTa-base-ca-v2      | 89.29  | 98.96  | 79.07        | 74.26      | 83.14     | 89.50/76.63     | 73.64/55.42                   |
 | DistilRoBERTa-base-ca   | 87.88  | 98.83  | 77.26        | 73.20      | 76.00     | 84.07/70.77     | 62.93/45.08                   |
-<sup>1</sup> : Trained on CatalanQA, tested on XQuAD-ca.

 ---
+language:
+- ca
 license: apache-2.0
 tags:
+- catalan
+- masked-lm
+- distilroberta
 widget:
+- text: El Català és una llengua molt <mask>.
+- text: Salvador Dalí va viure a <mask>.
+- text: La Costa Brava té les millors <mask> d'Espanya.
+- text: El cacaolat és un batut de <mask>.
+- text: <mask> és la capital de la Garrotxa.
+- text: Vaig al <mask> a buscar bolets.
+- text: Antoni Gaudí vas ser un <mask> molt important per la ciutat.
+- text: Catalunya és una referència en <mask> a nivell europeu.
 ---
 # DistilRoBERTa-base-ca
 ## Model description
 This model is a distilled version of [projecte-aina/roberta-base-ca-v2](https://huggingface.co/projecte-aina/roberta-base-ca-v2).
 It follows the same training procedure as [DistilBERT](https://arxiv.org/abs/1910.01108), using the implementation of Knowledge Distillation
 from the paper's [official repository](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation).
 ### Training procedure
+This model has been trained using a technique known as Knowledge Distillation,
+which is used to shrink networks to a reasonable size while minimizing the loss in performance.
+It basically consists in distilling a large language model (the teacher) into a more
+lightweight, energy-efficient, and production-friendly model (the student).
+So, in a “teacher-student learning” setup, a relatively small student model is trained to mimic the behavior of a larger teacher model.
+As a result, the student has lower inference time and the ability to run in commodity hardware.
 ### Training data
 The training corpus consists of several corpora gathered from web crawling and public corpora, as shown in the table below:
 | Corpus                   | Size (GB)  |
+|--------------------------|-----------:|
 | Catalan Crawling         | 13.00      |
 | RacoCatalá               | 8.10       |
 | Catalan Oscar            | 4.00       |
 | RoBERTa-base-ca-v2      | 89.29  | 98.96  | 79.07        | 74.26      | 83.14     | 89.50/76.63     | 73.64/55.42                   |
 | DistilRoBERTa-base-ca   | 87.88  | 98.83  | 77.26        | 73.20      | 76.00     | 84.07/70.77     | 62.93/45.08                   |
+<sup>1</sup> : Trained on CatalanQA, tested on XQuAD-ca (no train set).