Update README.md
Browse files
README.md
CHANGED
@@ -29,6 +29,7 @@ widget:
|
|
29 |
</div>
|
30 |
|
31 |
# Model Card for Minerva-7B-base-v1.0
|
|
|
32 |
Minerva is the first family of **LLMs pretrained from scratch on Italian** developed by [Sapienza NLP](https://nlp.uniroma1.it)
|
33 |
in collaboration with [Future Artificial Intelligence Research (FAIR)](https://fondazione-fair.it/) and [CINECA](https://www.cineca.it/).
|
34 |
Notably, the Minerva models are truly-open (data and model) Italian-English LLMs, with approximately half of the pretraining data
|
@@ -37,6 +38,7 @@ including Italian text.
|
|
37 |
* [Minerva LLMs - website](https://nlp.uniroma1.it/minerva/)
|
38 |
|
39 |
## Description
|
|
|
40 |
This is the model card for **Minerva-7B-base-v1.0**, a 7 billion parameter model trained on almost 2.5 trillion tokens (1.14 trillion in Italian,
|
41 |
1.14 trillion in English, and 200 billion in code).
|
42 |
|
@@ -48,6 +50,7 @@ This model is part of the Minerva LLM family:
|
|
48 |
* [Minerva-7B-base-v1.0](https://huggingface.co/sapienzanlp/Minerva-3B-base-v1.0)
|
49 |
|
50 |
## 🚨⚠️🚨 Bias, Risks, and Limitations 🚨⚠️🚨
|
|
|
51 |
*This section identifies foreseeable harms and misunderstandings.*
|
52 |
|
53 |
This is a foundation model, not subject to alignment. Model may:
|
@@ -65,6 +68,7 @@ This is a foundation model, not subject to alignment. Model may:
|
|
65 |
|
66 |
We are aware of the biases and potential problematic/toxic content that current pretrained large language models exhibit: more specifically, as probabilistic models of (Italian and English) languages, they reflect and amplify the biases of their training data.
|
67 |
For more information about this issue, please refer to our survey:
|
|
|
68 |
* [Biases in Large Language Models: Origins, Inventory, and Discussion](https://dl.acm.org/doi/full/10.1145/3597307)
|
69 |
|
70 |
## How to use Minerva with Hugging Face transformers
|
@@ -138,7 +142,6 @@ All the reported benchmark data was already present in the LM-Evaluation-Harness
|
|
138 |
| [M MMLU](https://huggingface.co/datasets/alexandrainst/m_mmlu) (5-shot) | 0.2612 |
|
139 |
| [arc challenge](https://huggingface.co/datasets/alexandrainst/m_arc) (5-shot) | 0.3268 | -->
|
140 |
|
141 |
-
|
142 |
**English** Data:
|
143 |
| Task | Accuracy |
|
144 |
| --- | --- |
|
@@ -152,19 +155,35 @@ All the reported benchmark data was already present in the LM-Evaluation-Harness
|
|
152 |
| [arc challenge](allenai/ai2_arc) (5-shot) | 0.3319 |
|
153 |
| [arc easy](allenai/ai2_arc) (5-shot) | 0.6540 | -->
|
154 |
|
155 |
-
|
156 |
## Training Data
|
157 |
|
158 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
159 |
|
160 |
-
We have extracted some statistics on Italian (115B tokens) and English (210B tokens) documents from CulturaX on the selected sources:
|
161 |
|
162 |
*Proportion of number of tokens per domain (Italian)*
|
163 |
<img src="https://github.com/Andrew-Wyn/images/blob/master/minerva/top_25_url_tokens_proportion_culturax_it.png?raw=true" alt="italian-tok-counts" border="0" width="1800px">
|
164 |
|
165 |
*Proportion of number of tokens per domain (English)*
|
166 |
-
<img src="https://github.com/Andrew-Wyn/images/blob/master/minerva/top_25_url_tokens_proportion_culturax_en.png?raw=true" alt="english-tok-counts" border="0" width="1800px">
|
167 |
-
|
168 |
## Tokenizer Fertility
|
169 |
|
170 |
The tokenizer fertility measures the average amount of tokens produced per tokenized word.
|
@@ -195,10 +214,11 @@ Minerva-7B-base-v1.0 is a pretrained base model and, therefore, has no moderatio
|
|
195 |
* **Roberto Navigli:** project coordinator
|
196 |
|
197 |
### Special thanks for their support
|
|
|
198 |
* Giuseppe Fiameni, Nvidia
|
199 |
* Sergio Orlandini, CINECA
|
200 |
|
201 |
## Acknowledgments
|
202 |
|
203 |
This work was funded by the PNRR MUR project [PE0000013-FAIR](https://fondazione-fair.it).
|
204 |
-
We acknowledge the [CINECA](https://www.cineca.it) award "IscB_medit" under the ISCRA initiative, for the availability of high performance computing resources and support.
|
|
|
29 |
</div>
|
30 |
|
31 |
# Model Card for Minerva-7B-base-v1.0
|
32 |
+
|
33 |
Minerva is the first family of **LLMs pretrained from scratch on Italian** developed by [Sapienza NLP](https://nlp.uniroma1.it)
|
34 |
in collaboration with [Future Artificial Intelligence Research (FAIR)](https://fondazione-fair.it/) and [CINECA](https://www.cineca.it/).
|
35 |
Notably, the Minerva models are truly-open (data and model) Italian-English LLMs, with approximately half of the pretraining data
|
|
|
38 |
* [Minerva LLMs - website](https://nlp.uniroma1.it/minerva/)
|
39 |
|
40 |
## Description
|
41 |
+
|
42 |
This is the model card for **Minerva-7B-base-v1.0**, a 7 billion parameter model trained on almost 2.5 trillion tokens (1.14 trillion in Italian,
|
43 |
1.14 trillion in English, and 200 billion in code).
|
44 |
|
|
|
50 |
* [Minerva-7B-base-v1.0](https://huggingface.co/sapienzanlp/Minerva-3B-base-v1.0)
|
51 |
|
52 |
## 🚨⚠️🚨 Bias, Risks, and Limitations 🚨⚠️🚨
|
53 |
+
|
54 |
*This section identifies foreseeable harms and misunderstandings.*
|
55 |
|
56 |
This is a foundation model, not subject to alignment. Model may:
|
|
|
68 |
|
69 |
We are aware of the biases and potential problematic/toxic content that current pretrained large language models exhibit: more specifically, as probabilistic models of (Italian and English) languages, they reflect and amplify the biases of their training data.
|
70 |
For more information about this issue, please refer to our survey:
|
71 |
+
|
72 |
* [Biases in Large Language Models: Origins, Inventory, and Discussion](https://dl.acm.org/doi/full/10.1145/3597307)
|
73 |
|
74 |
## How to use Minerva with Hugging Face transformers
|
|
|
142 |
| [M MMLU](https://huggingface.co/datasets/alexandrainst/m_mmlu) (5-shot) | 0.2612 |
|
143 |
| [arc challenge](https://huggingface.co/datasets/alexandrainst/m_arc) (5-shot) | 0.3268 | -->
|
144 |
|
|
|
145 |
**English** Data:
|
146 |
| Task | Accuracy |
|
147 |
| --- | --- |
|
|
|
155 |
| [arc challenge](allenai/ai2_arc) (5-shot) | 0.3319 |
|
156 |
| [arc easy](allenai/ai2_arc) (5-shot) | 0.6540 | -->
|
157 |
|
|
|
158 |
## Training Data
|
159 |
|
160 |
+
Minerva-7B-base-v1.0 is trained on 1.14T Italian tokens, 1.14T English tokens, and 200B code tokens.
|
161 |
+
|
162 |
+
The training data is a mixture of the following datasets:
|
163 |
+
|
164 |
+
| Dataset | Tokens | Language | Epochs |
|
165 |
+
| --- | --- | --- | --- |
|
166 |
+
| RedPajama-Data-V2 | 687,952,502,784 | Italian | 1.3 |
|
167 |
+
| CulturaX | 158,201,876,480 | Italian | 1.5 |
|
168 |
+
| Wikipedia | 1,265,135,616 | Italian | 1.0 |
|
169 |
+
| Gutenberg/Wikisource | 147,017,728 | Italian | 2.0 |
|
170 |
+
| EurLex | 1,647,013,888 | Italian | 1.0 |
|
171 |
+
| Gazzetta Ufficiale | 1,654,013,952| Italian | 1.0 |
|
172 |
+
| FineWeb | 1,076,406,624,256 | English | 1.0 |
|
173 |
+
| Wikipedia | 5,259,501,568 | English | 1.0 |
|
174 |
+
| ArXiv | 33,231,106,048 | English | 1.0 |
|
175 |
+
| Gutenberg | 6,947,893,248 | English | 1.0 |
|
176 |
+
| StackExchange | 22,069,268,480 | English | 1.0 |
|
177 |
+
| The Stack V2 | 200,754,900,992 | Code | 1.0 |
|
178 |
|
179 |
+
<!-- We have extracted some statistics on Italian (115B tokens) and English (210B tokens) documents from CulturaX on the selected sources:
|
180 |
|
181 |
*Proportion of number of tokens per domain (Italian)*
|
182 |
<img src="https://github.com/Andrew-Wyn/images/blob/master/minerva/top_25_url_tokens_proportion_culturax_it.png?raw=true" alt="italian-tok-counts" border="0" width="1800px">
|
183 |
|
184 |
*Proportion of number of tokens per domain (English)*
|
185 |
+
<img src="https://github.com/Andrew-Wyn/images/blob/master/minerva/top_25_url_tokens_proportion_culturax_en.png?raw=true" alt="english-tok-counts" border="0" width="1800px">
|
186 |
+
-->
|
187 |
## Tokenizer Fertility
|
188 |
|
189 |
The tokenizer fertility measures the average amount of tokens produced per tokenized word.
|
|
|
214 |
* **Roberto Navigli:** project coordinator
|
215 |
|
216 |
### Special thanks for their support
|
217 |
+
|
218 |
* Giuseppe Fiameni, Nvidia
|
219 |
* Sergio Orlandini, CINECA
|
220 |
|
221 |
## Acknowledgments
|
222 |
|
223 |
This work was funded by the PNRR MUR project [PE0000013-FAIR](https://fondazione-fair.it).
|
224 |
+
We acknowledge the [CINECA](https://www.cineca.it) award "IscB_medit" under the ISCRA initiative, for the availability of high performance computing resources and support.
|