occiglot
/

occiglot-7b-it-en

@@ -2,20 +2,20 @@
 license: apache-2.0
 language:
 - en
-- fr
 pipeline_tag: text-generation
 ---
 ![image/png](https://huggingface.co/datasets/malteos/images/resolve/main/occiglot.medium.png)
-# Occiglot-7B-FR-EN
 > A [polyglot](https://en.wikipedia.org/wiki/Multilingualism#In_individuals) language model for the [Occident](https://en.wikipedia.org/wiki/Occident).
 >
-**Occiglot-7B-FR-EN** is a generative language model with 7B parameters for French and English and trained by the [Occiglot Research Collective](https://ociglot.eu).
 It is based on [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) and trained on 113B tokens of additional multilingual and code data with a block size of 8,192 tokens per sample.
-Note that the model is a general-purpose base model and was not instruction-fine-tuned nor optimized for chat or other applications. We make an instruction tuned variant available as [occiglot-7b-fr-en-instruct](https://huggingface.co/occiglot/occiglot-7b-fr-en-instruct)
 This is the first release of an ongoing open research project for multilingual language models.
 If you want to train a model for your own language or are working on evaluations, please contact us or join our [Discord server](https://discord.gg/wUpvYs4XvM). **We are open for collaborations!**
@@ -25,7 +25,7 @@ If you want to train a model for your own language or are working on evaluations
 - **Continued-pretraining from:** [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
 - **Model type:** Causal decoder-only transformer language model
-- **Languages:** English, German, and code.
 - **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0.html)
 - **Compute resources:** [HessianAI's 42](https://hessian.ai/)
 - **Contributors:** Manuel Brack, Patrick Schramowski, Pedro Ortiz, Malte Ostendorff, Fabio Barth, Georg Rehm, Kristian Kersting
@@ -39,20 +39,20 @@ set a seed for reproducibility:
 ```python
 >>> from transformers import pipeline, set_seed
->>> generator = pipeline('text-generation', model='occiglot/occiglot-7b-eu5')
 >>> set_seed(42)
->>> generator("Bonjour, Je suis un modèle linguistique,", max_length=40, num_return_sequences=1)
-[{'generated_text': 'Bonjour, Je suis un modèle linguistique qui peut t'aider à traduire des textes entre le français et l'anglais. Si tu me donnes un texte en français'}]
 ```
 ## Dataset
-The training data is the respective subset of the data used for [occiglot-7b-eu5](https://huggingface.co/occiglot/occiglot-7b-eu5), i.e. Spanish plus English and Code.
 The data distribution by language (estimated) is as follows:
 - English: ~34%
 - Code: ~13%
-- French: ~52%
 The training data was prepared using [lm-datasets](https://github.com/malteos/lm-datasets).
 The exact data configuration is [here](https://huggingface.co/occiglot/occiglot-7b-eu5/blob/main/lm-datasets-config.yml).

 license: apache-2.0
 language:
 - en
+- it
 pipeline_tag: text-generation
 ---
 ![image/png](https://huggingface.co/datasets/malteos/images/resolve/main/occiglot.medium.png)
+# Occiglot-7B-IT-EN
 > A [polyglot](https://en.wikipedia.org/wiki/Multilingualism#In_individuals) language model for the [Occident](https://en.wikipedia.org/wiki/Occident).
 >
+**Occiglot-7B-IT-EN** is a generative language model with 7B parameters for Italian and English and trained by the [Occiglot Research Collective](https://ociglot.eu).
 It is based on [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) and trained on 113B tokens of additional multilingual and code data with a block size of 8,192 tokens per sample.
+Note that the model is a general-purpose base model and was not instruction-fine-tuned nor optimized for chat or other applications. We make an instruction tuned variant available as [occiglot-7b-it-en-instruct](https://huggingface.co/occiglot/occiglot-7b-fr-en-instruct)
 This is the first release of an ongoing open research project for multilingual language models.
 If you want to train a model for your own language or are working on evaluations, please contact us or join our [Discord server](https://discord.gg/wUpvYs4XvM). **We are open for collaborations!**
 - **Continued-pretraining from:** [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
 - **Model type:** Causal decoder-only transformer language model
+- **Languages:** English, Italian, and code.
 - **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0.html)
 - **Compute resources:** [HessianAI's 42](https://hessian.ai/)
 - **Contributors:** Manuel Brack, Patrick Schramowski, Pedro Ortiz, Malte Ostendorff, Fabio Barth, Georg Rehm, Kristian Kersting
 ```python
 >>> from transformers import pipeline, set_seed
+>>> generator = pipeline('text-generation', model='occiglot/occiglot-7b-it-en')
 >>> set_seed(42)
+>>> generator("Salve, sono una modella linguistica,", max_length=40, num_return_sequences=1)
+[{'generated_text': 'Salve, sono una modella linguistica che può aiutarvi a tradurre testi tra l'italiano e l'inglese. Se mi inviate un testo in italiano'}]
 ```
 ## Dataset
+The training data is the respective subset of the data used for [occiglot-7b-eu5](https://huggingface.co/occiglot/occiglot-7b-eu5), i.e. Italian plus English and Code.
 The data distribution by language (estimated) is as follows:
 - English: ~34%
 - Code: ~13%
+- Italian: ~52%
 The training data was prepared using [lm-datasets](https://github.com/malteos/lm-datasets).
 The exact data configuration is [here](https://huggingface.co/occiglot/occiglot-7b-eu5/blob/main/lm-datasets-config.yml).