riccorl commited on
Commit
b3d75ef
1 Parent(s): 842799c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +27 -7
README.md CHANGED
@@ -29,6 +29,7 @@ widget:
29
  </div>
30
 
31
  # Model Card for Minerva-7B-base-v1.0
 
32
  Minerva is the first family of **LLMs pretrained from scratch on Italian** developed by [Sapienza NLP](https://nlp.uniroma1.it)
33
  in collaboration with [Future Artificial Intelligence Research (FAIR)](https://fondazione-fair.it/) and [CINECA](https://www.cineca.it/).
34
  Notably, the Minerva models are truly-open (data and model) Italian-English LLMs, with approximately half of the pretraining data
@@ -37,6 +38,7 @@ including Italian text.
37
  * [Minerva LLMs - website](https://nlp.uniroma1.it/minerva/)
38
 
39
  ## Description
 
40
  This is the model card for **Minerva-7B-base-v1.0**, a 7 billion parameter model trained on almost 2.5 trillion tokens (1.14 trillion in Italian,
41
  1.14 trillion in English, and 200 billion in code).
42
 
@@ -48,6 +50,7 @@ This model is part of the Minerva LLM family:
48
  * [Minerva-7B-base-v1.0](https://huggingface.co/sapienzanlp/Minerva-3B-base-v1.0)
49
 
50
  ## 🚨⚠️🚨 Bias, Risks, and Limitations 🚨⚠️🚨
 
51
  *This section identifies foreseeable harms and misunderstandings.*
52
 
53
  This is a foundation model, not subject to alignment. Model may:
@@ -65,6 +68,7 @@ This is a foundation model, not subject to alignment. Model may:
65
 
66
  We are aware of the biases and potential problematic/toxic content that current pretrained large language models exhibit: more specifically, as probabilistic models of (Italian and English) languages, they reflect and amplify the biases of their training data.
67
  For more information about this issue, please refer to our survey:
 
68
  * [Biases in Large Language Models: Origins, Inventory, and Discussion](https://dl.acm.org/doi/full/10.1145/3597307)
69
 
70
  ## How to use Minerva with Hugging Face transformers
@@ -138,7 +142,6 @@ All the reported benchmark data was already present in the LM-Evaluation-Harness
138
  | [M MMLU](https://huggingface.co/datasets/alexandrainst/m_mmlu) (5-shot) | 0.2612 |
139
  | [arc challenge](https://huggingface.co/datasets/alexandrainst/m_arc) (5-shot) | 0.3268 | -->
140
 
141
-
142
  **English** Data:
143
  | Task | Accuracy |
144
  | --- | --- |
@@ -152,19 +155,35 @@ All the reported benchmark data was already present in the LM-Evaluation-Harness
152
  | [arc challenge](allenai/ai2_arc) (5-shot) | 0.3319 |
153
  | [arc easy](allenai/ai2_arc) (5-shot) | 0.6540 | -->
154
 
155
-
156
  ## Training Data
157
 
158
- <!-- Minerva-7B-base-v1.0 was trained on 1T Italian tokens and 1T English tokens sampled from CulturaX.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
159
 
160
- We have extracted some statistics on Italian (115B tokens) and English (210B tokens) documents from CulturaX on the selected sources:
161
 
162
  *Proportion of number of tokens per domain (Italian)*
163
  <img src="https://github.com/Andrew-Wyn/images/blob/master/minerva/top_25_url_tokens_proportion_culturax_it.png?raw=true" alt="italian-tok-counts" border="0" width="1800px">
164
 
165
  *Proportion of number of tokens per domain (English)*
166
- <img src="https://github.com/Andrew-Wyn/images/blob/master/minerva/top_25_url_tokens_proportion_culturax_en.png?raw=true" alt="english-tok-counts" border="0" width="1800px"> -->
167
-
168
  ## Tokenizer Fertility
169
 
170
  The tokenizer fertility measures the average amount of tokens produced per tokenized word.
@@ -195,10 +214,11 @@ Minerva-7B-base-v1.0 is a pretrained base model and, therefore, has no moderatio
195
  * **Roberto Navigli:** project coordinator
196
 
197
  ### Special thanks for their support
 
198
  * Giuseppe Fiameni, Nvidia
199
  * Sergio Orlandini, CINECA
200
 
201
  ## Acknowledgments
202
 
203
  This work was funded by the PNRR MUR project [PE0000013-FAIR](https://fondazione-fair.it).
204
- We acknowledge the [CINECA](https://www.cineca.it) award "IscB_medit" under the ISCRA initiative, for the availability of high performance computing resources and support.
 
29
  </div>
30
 
31
  # Model Card for Minerva-7B-base-v1.0
32
+
33
  Minerva is the first family of **LLMs pretrained from scratch on Italian** developed by [Sapienza NLP](https://nlp.uniroma1.it)
34
  in collaboration with [Future Artificial Intelligence Research (FAIR)](https://fondazione-fair.it/) and [CINECA](https://www.cineca.it/).
35
  Notably, the Minerva models are truly-open (data and model) Italian-English LLMs, with approximately half of the pretraining data
 
38
  * [Minerva LLMs - website](https://nlp.uniroma1.it/minerva/)
39
 
40
  ## Description
41
+
42
  This is the model card for **Minerva-7B-base-v1.0**, a 7 billion parameter model trained on almost 2.5 trillion tokens (1.14 trillion in Italian,
43
  1.14 trillion in English, and 200 billion in code).
44
 
 
50
  * [Minerva-7B-base-v1.0](https://huggingface.co/sapienzanlp/Minerva-3B-base-v1.0)
51
 
52
  ## 🚨⚠️🚨 Bias, Risks, and Limitations 🚨⚠️🚨
53
+
54
  *This section identifies foreseeable harms and misunderstandings.*
55
 
56
  This is a foundation model, not subject to alignment. Model may:
 
68
 
69
  We are aware of the biases and potential problematic/toxic content that current pretrained large language models exhibit: more specifically, as probabilistic models of (Italian and English) languages, they reflect and amplify the biases of their training data.
70
  For more information about this issue, please refer to our survey:
71
+
72
  * [Biases in Large Language Models: Origins, Inventory, and Discussion](https://dl.acm.org/doi/full/10.1145/3597307)
73
 
74
  ## How to use Minerva with Hugging Face transformers
 
142
  | [M MMLU](https://huggingface.co/datasets/alexandrainst/m_mmlu) (5-shot) | 0.2612 |
143
  | [arc challenge](https://huggingface.co/datasets/alexandrainst/m_arc) (5-shot) | 0.3268 | -->
144
 
 
145
  **English** Data:
146
  | Task | Accuracy |
147
  | --- | --- |
 
155
  | [arc challenge](allenai/ai2_arc) (5-shot) | 0.3319 |
156
  | [arc easy](allenai/ai2_arc) (5-shot) | 0.6540 | -->
157
 
 
158
  ## Training Data
159
 
160
+ Minerva-7B-base-v1.0 is trained on 1.14T Italian tokens, 1.14T English tokens, and 200B code tokens.
161
+
162
+ The training data is a mixture of the following datasets:
163
+
164
+ | Dataset | Tokens | Language | Epochs |
165
+ | --- | --- | --- | --- |
166
+ | RedPajama-Data-V2 | 687,952,502,784 | Italian | 1.3 |
167
+ | CulturaX | 158,201,876,480 | Italian | 1.5 |
168
+ | Wikipedia | 1,265,135,616 | Italian | 1.0 |
169
+ | Gutenberg/Wikisource | 147,017,728 | Italian | 2.0 |
170
+ | EurLex | 1,647,013,888 | Italian | 1.0 |
171
+ | Gazzetta Ufficiale | 1,654,013,952| Italian | 1.0 |
172
+ | FineWeb | 1,076,406,624,256 | English | 1.0 |
173
+ | Wikipedia | 5,259,501,568 | English | 1.0 |
174
+ | ArXiv | 33,231,106,048 | English | 1.0 |
175
+ | Gutenberg | 6,947,893,248 | English | 1.0 |
176
+ | StackExchange | 22,069,268,480 | English | 1.0 |
177
+ | The Stack V2 | 200,754,900,992 | Code | 1.0 |
178
 
179
+ <!-- We have extracted some statistics on Italian (115B tokens) and English (210B tokens) documents from CulturaX on the selected sources:
180
 
181
  *Proportion of number of tokens per domain (Italian)*
182
  <img src="https://github.com/Andrew-Wyn/images/blob/master/minerva/top_25_url_tokens_proportion_culturax_it.png?raw=true" alt="italian-tok-counts" border="0" width="1800px">
183
 
184
  *Proportion of number of tokens per domain (English)*
185
+ <img src="https://github.com/Andrew-Wyn/images/blob/master/minerva/top_25_url_tokens_proportion_culturax_en.png?raw=true" alt="english-tok-counts" border="0" width="1800px">
186
+ -->
187
  ## Tokenizer Fertility
188
 
189
  The tokenizer fertility measures the average amount of tokens produced per tokenized word.
 
214
  * **Roberto Navigli:** project coordinator
215
 
216
  ### Special thanks for their support
217
+
218
  * Giuseppe Fiameni, Nvidia
219
  * Sergio Orlandini, CINECA
220
 
221
  ## Acknowledgments
222
 
223
  This work was funded by the PNRR MUR project [PE0000013-FAIR](https://fondazione-fair.it).
224
+ We acknowledge the [CINECA](https://www.cineca.it) award "IscB_medit" under the ISCRA initiative, for the availability of high performance computing resources and support.