sapienzanlp
/

Minerva-7B-base-v1.0

@@ -29,6 +29,7 @@ widget:
 </div>
 # Model Card for Minerva-7B-base-v1.0
 Minerva is the first family of **LLMs pretrained from scratch on Italian** developed by [Sapienza NLP](https://nlp.uniroma1.it)
 in collaboration with [Future Artificial Intelligence Research (FAIR)](https://fondazione-fair.it/) and [CINECA](https://www.cineca.it/).
 Notably, the Minerva models are truly-open (data and model) Italian-English LLMs, with approximately half of the pretraining data
@@ -37,6 +38,7 @@ including Italian text.
 * [Minerva LLMs - website](https://nlp.uniroma1.it/minerva/)
 ## Description
 This is the model card for **Minerva-7B-base-v1.0**, a 7 billion parameter model trained on almost 2.5 trillion tokens (1.14 trillion in Italian,
 1.14 trillion in English, and 200 billion in code).
@@ -48,6 +50,7 @@ This model is part of the Minerva LLM family:
 * [Minerva-7B-base-v1.0](https://huggingface.co/sapienzanlp/Minerva-3B-base-v1.0)
 ## 🚨⚠️🚨 Bias, Risks, and Limitations 🚨⚠️🚨
 *This section identifies foreseeable harms and misunderstandings.*
 This is a foundation model, not subject to alignment. Model may:
@@ -65,6 +68,7 @@ This is a foundation model, not subject to alignment. Model may:
 We are aware of the biases and potential problematic/toxic content that current pretrained large language models exhibit: more specifically, as probabilistic models of (Italian and English) languages, they reflect and amplify the biases of their training data.
 For more information about this issue, please refer to our survey:
 * [Biases in Large Language Models: Origins, Inventory, and Discussion](https://dl.acm.org/doi/full/10.1145/3597307)
 ## How to use Minerva with Hugging Face transformers
@@ -138,7 +142,6 @@ All the reported benchmark data was already present in the LM-Evaluation-Harness
 | [M MMLU](https://huggingface.co/datasets/alexandrainst/m_mmlu) (5-shot) | 0.2612 |
 | [arc challenge](https://huggingface.co/datasets/alexandrainst/m_arc) (5-shot) | 0.3268 | -->
 **English** Data:
 | Task | Accuracy |
 | --- | --- |
@@ -152,19 +155,35 @@ All the reported benchmark data was already present in the LM-Evaluation-Harness
 | [arc challenge](allenai/ai2_arc) (5-shot) | 0.3319 |
 | [arc easy](allenai/ai2_arc) (5-shot) | 0.6540 | -->
 ## Training Data
-<!-- Minerva-7B-base-v1.0 was trained on 1T Italian tokens and 1T English tokens sampled from CulturaX.
-We have extracted some statistics on Italian (115B tokens) and English (210B tokens) documents from CulturaX on the selected sources:
 *Proportion of number of tokens per domain (Italian)*
 <img src="https://github.com/Andrew-Wyn/images/blob/master/minerva/top_25_url_tokens_proportion_culturax_it.png?raw=true" alt="italian-tok-counts" border="0" width="1800px">
 *Proportion of number of tokens per domain (English)*
-<img src="https://github.com/Andrew-Wyn/images/blob/master/minerva/top_25_url_tokens_proportion_culturax_en.png?raw=true" alt="english-tok-counts" border="0" width="1800px"> -->
 ## Tokenizer Fertility
 The tokenizer fertility measures the average amount of tokens produced per tokenized word.
@@ -195,10 +214,11 @@ Minerva-7B-base-v1.0 is a pretrained base model and, therefore, has no moderatio
 * **Roberto Navigli:** project coordinator
 ### Special thanks for their support
 * Giuseppe Fiameni, Nvidia
 * Sergio Orlandini, CINECA
 ## Acknowledgments
 This work was funded by the PNRR MUR project [PE0000013-FAIR](https://fondazione-fair.it).
-We acknowledge the [CINECA](https://www.cineca.it) award "IscB_medit" under the ISCRA initiative, for the availability of high performance computing resources and support.

 </div>
 # Model Card for Minerva-7B-base-v1.0
 Minerva is the first family of **LLMs pretrained from scratch on Italian** developed by [Sapienza NLP](https://nlp.uniroma1.it)
 in collaboration with [Future Artificial Intelligence Research (FAIR)](https://fondazione-fair.it/) and [CINECA](https://www.cineca.it/).
 Notably, the Minerva models are truly-open (data and model) Italian-English LLMs, with approximately half of the pretraining data
 * [Minerva LLMs - website](https://nlp.uniroma1.it/minerva/)
 ## Description
 This is the model card for **Minerva-7B-base-v1.0**, a 7 billion parameter model trained on almost 2.5 trillion tokens (1.14 trillion in Italian,
 1.14 trillion in English, and 200 billion in code).
 * [Minerva-7B-base-v1.0](https://huggingface.co/sapienzanlp/Minerva-3B-base-v1.0)
 ## 🚨⚠️🚨 Bias, Risks, and Limitations 🚨⚠️🚨
 *This section identifies foreseeable harms and misunderstandings.*
 This is a foundation model, not subject to alignment. Model may:
 We are aware of the biases and potential problematic/toxic content that current pretrained large language models exhibit: more specifically, as probabilistic models of (Italian and English) languages, they reflect and amplify the biases of their training data.
 For more information about this issue, please refer to our survey:
 * [Biases in Large Language Models: Origins, Inventory, and Discussion](https://dl.acm.org/doi/full/10.1145/3597307)
 ## How to use Minerva with Hugging Face transformers
 | [M MMLU](https://huggingface.co/datasets/alexandrainst/m_mmlu) (5-shot) | 0.2612 |
 | [arc challenge](https://huggingface.co/datasets/alexandrainst/m_arc) (5-shot) | 0.3268 | -->
 **English** Data:
 | Task | Accuracy |
 | --- | --- |
 | [arc challenge](allenai/ai2_arc) (5-shot) | 0.3319 |
 | [arc easy](allenai/ai2_arc) (5-shot) | 0.6540 | -->
 ## Training Data
+Minerva-7B-base-v1.0 is trained on 1.14T Italian tokens, 1.14T English tokens, and 200B code tokens.
+The training data is a mixture of the following datasets:
+| Dataset | Tokens | Language | Epochs |
+| --- | --- | --- | --- |
+| RedPajama-Data-V2 | 687,952,502,784 | Italian | 1.3 |
+| CulturaX | 158,201,876,480 | Italian | 1.5 |
+| Wikipedia | 1,265,135,616 | Italian | 1.0 |
+| Gutenberg/Wikisource | 147,017,728 | Italian | 2.0 |
+| EurLex | 1,647,013,888 | Italian | 1.0 |
+| Gazzetta Ufficiale | 1,654,013,952| Italian | 1.0 |
+| FineWeb | 1,076,406,624,256 | English | 1.0 |
+| Wikipedia | 5,259,501,568 | English | 1.0 |
+| ArXiv | 33,231,106,048 | English | 1.0 |
+| Gutenberg | 6,947,893,248 | English | 1.0 |
+| StackExchange | 22,069,268,480 | English | 1.0 |
+| The Stack V2 | 200,754,900,992 | Code | 1.0 |
+<!-- We have extracted some statistics on Italian (115B tokens) and English (210B tokens) documents from CulturaX on the selected sources:
 *Proportion of number of tokens per domain (Italian)*
 <img src="https://github.com/Andrew-Wyn/images/blob/master/minerva/top_25_url_tokens_proportion_culturax_it.png?raw=true" alt="italian-tok-counts" border="0" width="1800px">
 *Proportion of number of tokens per domain (English)*
+<img src="https://github.com/Andrew-Wyn/images/blob/master/minerva/top_25_url_tokens_proportion_culturax_en.png?raw=true" alt="english-tok-counts" border="0" width="1800px">
+ -->
 ## Tokenizer Fertility
 The tokenizer fertility measures the average amount of tokens produced per tokenized word.
 * **Roberto Navigli:** project coordinator
 ### Special thanks for their support
 * Giuseppe Fiameni, Nvidia
 * Sergio Orlandini, CINECA
 ## Acknowledgments
 This work was funded by the PNRR MUR project [PE0000013-FAIR](https://fondazione-fair.it).
+We acknowledge the [CINECA](https://www.cineca.it) award "IscB_medit" under the ISCRA initiative, for the availability of high performance computing resources and support.