Update README.md
Browse files
README.md
CHANGED
@@ -33,23 +33,22 @@ widget:
|
|
33 |
<h3>Introduction</h3>
|
34 |
|
35 |
This model is a <b>lightweight</b> and uncased version of <b>BERT</b> <b>[1]</b> for the <b>italian</b> language. With its <b>55M parameters</b> and <b>220MB</b> size,
|
36 |
-
it's <b>50% lighter</b> than a typical mono-lingual BERT model
|
37 |
-
for circumstances where memory consumption and execution speed are critical aspects, while maintaining high quality results.
|
38 |
|
39 |
|
40 |
<h3>Model description</h3>
|
41 |
|
42 |
-
The model
|
43 |
-
|
44 |
(as in <b>[3]</b>, but computing document-level frequencies over the <b>Wikipedia</b> dataset and setting a frequency threshold of 0.1%), which brings a considerable
|
45 |
reduction in the number of parameters.
|
46 |
|
47 |
-
|
48 |
the model has been further pre-trained on the italian split of the [Wikipedia](https://huggingface.co/datasets/wikipedia) dataset, using the <b>whole word masking [4]</b> technique to make it more robust
|
49 |
-
|
50 |
|
51 |
The resulting model has 55M parameters, a vocabulary of 13.832 tokens, and a size of 220MB, which makes it <b>50% lighter</b> than a typical mono-lingual BERT model and
|
52 |
-
20% lighter than a
|
53 |
|
54 |
|
55 |
<h3>Training procedure</h3>
|
@@ -71,12 +70,12 @@ provided with the dataset, while for Named Entity Recognition the metrics have b
|
|
71 |
| Part of Speech Tagging | 97.48 | 97.29 | 97.37 |
|
72 |
| Named Entity Recognition | 89.29 | 89.84 | 89.53 |
|
73 |
|
74 |
-
The metrics have been computed at token level and macro-averaged over the classes.
|
75 |
|
76 |
|
77 |
<h3>Demo</h3>
|
78 |
|
79 |
-
You can try the model online (fine-tuned on named entity recognition) using this
|
80 |
|
81 |
<h3>Quick usage</h3>
|
82 |
|
@@ -92,8 +91,8 @@ pipeline_mlm = pipeline(task="fill-mask", model=model, tokenizer=tokenizer)
|
|
92 |
|
93 |
<h3>Limitations</h3>
|
94 |
|
95 |
-
This lightweight model is mainly trained on Wikipedia, so it's particularly suitable as an agile analyzer for large volumes of natively digital text
|
96 |
-
from the world wide web, written in a correct and fluent form (like wikis, web pages, news, etc.).
|
97 |
(like social media posts) or when it comes to domain-specific text (like medical, financial or legal content).
|
98 |
|
99 |
<h3>References</h3>
|
|
|
33 |
<h3>Introduction</h3>
|
34 |
|
35 |
This model is a <b>lightweight</b> and uncased version of <b>BERT</b> <b>[1]</b> for the <b>italian</b> language. With its <b>55M parameters</b> and <b>220MB</b> size,
|
36 |
+
it's <b>50% lighter</b> than a typical mono-lingual BERT model. It is ideal when memory consumption and execution speed are critical while maintaining high quality results.
|
|
|
37 |
|
38 |
|
39 |
<h3>Model description</h3>
|
40 |
|
41 |
+
The model builds on the multilingual <b>DistilBERT</b> <b>[2]</b> model (from the HuggingFace team: [distilbert-base-multilingual-cased](https://huggingface.co/distilbert-base-multilingual-cased)) as a starting point,
|
42 |
+
focusing it on the italian language while at the same time turning it into an uncased model by modifying the embedding layer
|
43 |
(as in <b>[3]</b>, but computing document-level frequencies over the <b>Wikipedia</b> dataset and setting a frequency threshold of 0.1%), which brings a considerable
|
44 |
reduction in the number of parameters.
|
45 |
|
46 |
+
To compensate for the deletion of cased tokens, which now forces the model to exploit lowercase representations of words previously capitalized,
|
47 |
the model has been further pre-trained on the italian split of the [Wikipedia](https://huggingface.co/datasets/wikipedia) dataset, using the <b>whole word masking [4]</b> technique to make it more robust
|
48 |
+
to the new uncased representations.
|
49 |
|
50 |
The resulting model has 55M parameters, a vocabulary of 13.832 tokens, and a size of 220MB, which makes it <b>50% lighter</b> than a typical mono-lingual BERT model and
|
51 |
+
20% lighter than a standard mono-lingual DistilBERT model.
|
52 |
|
53 |
|
54 |
<h3>Training procedure</h3>
|
|
|
70 |
| Part of Speech Tagging | 97.48 | 97.29 | 97.37 |
|
71 |
| Named Entity Recognition | 89.29 | 89.84 | 89.53 |
|
72 |
|
73 |
+
The metrics have been computed at the token level and macro-averaged over the classes.
|
74 |
|
75 |
|
76 |
<h3>Demo</h3>
|
77 |
|
78 |
+
You can try the model online (fine-tuned on named entity recognition) using this web app: https://huggingface.co/spaces/osiria/next-it-demo
|
79 |
|
80 |
<h3>Quick usage</h3>
|
81 |
|
|
|
91 |
|
92 |
<h3>Limitations</h3>
|
93 |
|
94 |
+
This lightweight model is mainly trained on Wikipedia, so it's particularly suitable as an agile analyzer for large volumes of natively digital text
|
95 |
+
from the world wide web, written in a correct and fluent form (like wikis, web pages, news, etc.). Hpwever, it may show limitations when it comes to chaotic text, containing errors and slang expressions
|
96 |
(like social media posts) or when it comes to domain-specific text (like medical, financial or legal content).
|
97 |
|
98 |
<h3>References</h3>
|