mbrack commited on
Commit
516f8a5
·
verified ·
1 Parent(s): a068e05

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -10
README.md CHANGED
@@ -2,20 +2,20 @@
2
  license: apache-2.0
3
  language:
4
  - en
5
- - fr
6
  pipeline_tag: text-generation
7
  ---
8
 
9
  ![image/png](https://huggingface.co/datasets/malteos/images/resolve/main/occiglot.medium.png)
10
 
11
- # Occiglot-7B-FR-EN
12
 
13
  > A [polyglot](https://en.wikipedia.org/wiki/Multilingualism#In_individuals) language model for the [Occident](https://en.wikipedia.org/wiki/Occident).
14
  >
15
 
16
- **Occiglot-7B-FR-EN** is a generative language model with 7B parameters for French and English and trained by the [Occiglot Research Collective](https://ociglot.eu).
17
  It is based on [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) and trained on 113B tokens of additional multilingual and code data with a block size of 8,192 tokens per sample.
18
- Note that the model is a general-purpose base model and was not instruction-fine-tuned nor optimized for chat or other applications. We make an instruction tuned variant available as [occiglot-7b-fr-en-instruct](https://huggingface.co/occiglot/occiglot-7b-fr-en-instruct)
19
 
20
  This is the first release of an ongoing open research project for multilingual language models.
21
  If you want to train a model for your own language or are working on evaluations, please contact us or join our [Discord server](https://discord.gg/wUpvYs4XvM). **We are open for collaborations!**
@@ -25,7 +25,7 @@ If you want to train a model for your own language or are working on evaluations
25
 
26
  - **Continued-pretraining from:** [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
27
  - **Model type:** Causal decoder-only transformer language model
28
- - **Languages:** English, German, and code.
29
  - **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0.html)
30
  - **Compute resources:** [HessianAI's 42](https://hessian.ai/)
31
  - **Contributors:** Manuel Brack, Patrick Schramowski, Pedro Ortiz, Malte Ostendorff, Fabio Barth, Georg Rehm, Kristian Kersting
@@ -39,20 +39,20 @@ set a seed for reproducibility:
39
 
40
  ```python
41
  >>> from transformers import pipeline, set_seed
42
- >>> generator = pipeline('text-generation', model='occiglot/occiglot-7b-eu5')
43
  >>> set_seed(42)
44
- >>> generator("Bonjour, Je suis un modèle linguistique,", max_length=40, num_return_sequences=1)
45
- [{'generated_text': 'Bonjour, Je suis un modèle linguistique qui peut t'aider à traduire des textes entre le français et l'anglais. Si tu me donnes un texte en français'}]
46
  ```
47
 
48
  ## Dataset
49
 
50
- The training data is the respective subset of the data used for [occiglot-7b-eu5](https://huggingface.co/occiglot/occiglot-7b-eu5), i.e. Spanish plus English and Code.
51
 
52
  The data distribution by language (estimated) is as follows:
53
  - English: ~34%
54
  - Code: ~13%
55
- - French: ~52%
56
 
57
  The training data was prepared using [lm-datasets](https://github.com/malteos/lm-datasets).
58
  The exact data configuration is [here](https://huggingface.co/occiglot/occiglot-7b-eu5/blob/main/lm-datasets-config.yml).
 
2
  license: apache-2.0
3
  language:
4
  - en
5
+ - it
6
  pipeline_tag: text-generation
7
  ---
8
 
9
  ![image/png](https://huggingface.co/datasets/malteos/images/resolve/main/occiglot.medium.png)
10
 
11
+ # Occiglot-7B-IT-EN
12
 
13
  > A [polyglot](https://en.wikipedia.org/wiki/Multilingualism#In_individuals) language model for the [Occident](https://en.wikipedia.org/wiki/Occident).
14
  >
15
 
16
+ **Occiglot-7B-IT-EN** is a generative language model with 7B parameters for Italian and English and trained by the [Occiglot Research Collective](https://ociglot.eu).
17
  It is based on [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) and trained on 113B tokens of additional multilingual and code data with a block size of 8,192 tokens per sample.
18
+ Note that the model is a general-purpose base model and was not instruction-fine-tuned nor optimized for chat or other applications. We make an instruction tuned variant available as [occiglot-7b-it-en-instruct](https://huggingface.co/occiglot/occiglot-7b-fr-en-instruct)
19
 
20
  This is the first release of an ongoing open research project for multilingual language models.
21
  If you want to train a model for your own language or are working on evaluations, please contact us or join our [Discord server](https://discord.gg/wUpvYs4XvM). **We are open for collaborations!**
 
25
 
26
  - **Continued-pretraining from:** [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
27
  - **Model type:** Causal decoder-only transformer language model
28
+ - **Languages:** English, Italian, and code.
29
  - **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0.html)
30
  - **Compute resources:** [HessianAI's 42](https://hessian.ai/)
31
  - **Contributors:** Manuel Brack, Patrick Schramowski, Pedro Ortiz, Malte Ostendorff, Fabio Barth, Georg Rehm, Kristian Kersting
 
39
 
40
  ```python
41
  >>> from transformers import pipeline, set_seed
42
+ >>> generator = pipeline('text-generation', model='occiglot/occiglot-7b-it-en')
43
  >>> set_seed(42)
44
+ >>> generator("Salve, sono una modella linguistica,", max_length=40, num_return_sequences=1)
45
+ [{'generated_text': 'Salve, sono una modella linguistica che può aiutarvi a tradurre testi tra l'italiano e l'inglese. Se mi inviate un testo in italiano'}]
46
  ```
47
 
48
  ## Dataset
49
 
50
+ The training data is the respective subset of the data used for [occiglot-7b-eu5](https://huggingface.co/occiglot/occiglot-7b-eu5), i.e. Italian plus English and Code.
51
 
52
  The data distribution by language (estimated) is as follows:
53
  - English: ~34%
54
  - Code: ~13%
55
+ - Italian: ~52%
56
 
57
  The training data was prepared using [lm-datasets](https://github.com/malteos/lm-datasets).
58
  The exact data configuration is [here](https://huggingface.co/occiglot/occiglot-7b-eu5/blob/main/lm-datasets-config.yml).