Update README.md
Browse files
README.md
CHANGED
@@ -2,20 +2,20 @@
|
|
2 |
license: apache-2.0
|
3 |
language:
|
4 |
- en
|
5 |
-
-
|
6 |
pipeline_tag: text-generation
|
7 |
---
|
8 |
|
9 |
![image/png](https://huggingface.co/datasets/malteos/images/resolve/main/occiglot.medium.png)
|
10 |
|
11 |
-
# Occiglot-7B-
|
12 |
|
13 |
> A [polyglot](https://en.wikipedia.org/wiki/Multilingualism#In_individuals) language model for the [Occident](https://en.wikipedia.org/wiki/Occident).
|
14 |
>
|
15 |
|
16 |
-
**Occiglot-7B-
|
17 |
It is based on [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) and trained on 113B tokens of additional multilingual and code data with a block size of 8,192 tokens per sample.
|
18 |
-
Note that the model is a general-purpose base model and was not instruction-fine-tuned nor optimized for chat or other applications. We make an instruction tuned variant available as [occiglot-7b-
|
19 |
|
20 |
This is the first release of an ongoing open research project for multilingual language models.
|
21 |
If you want to train a model for your own language or are working on evaluations, please contact us or join our [Discord server](https://discord.gg/wUpvYs4XvM). **We are open for collaborations!**
|
@@ -25,7 +25,7 @@ If you want to train a model for your own language or are working on evaluations
|
|
25 |
|
26 |
- **Continued-pretraining from:** [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
|
27 |
- **Model type:** Causal decoder-only transformer language model
|
28 |
-
- **Languages:** English,
|
29 |
- **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0.html)
|
30 |
- **Compute resources:** [HessianAI's 42](https://hessian.ai/)
|
31 |
- **Contributors:** Manuel Brack, Patrick Schramowski, Pedro Ortiz, Malte Ostendorff, Fabio Barth, Georg Rehm, Kristian Kersting
|
@@ -39,20 +39,20 @@ set a seed for reproducibility:
|
|
39 |
|
40 |
```python
|
41 |
>>> from transformers import pipeline, set_seed
|
42 |
-
>>> generator = pipeline('text-generation', model='occiglot/occiglot-7b-
|
43 |
>>> set_seed(42)
|
44 |
-
>>> generator("
|
45 |
-
[{'generated_text': '
|
46 |
```
|
47 |
|
48 |
## Dataset
|
49 |
|
50 |
-
The training data is the respective subset of the data used for [occiglot-7b-eu5](https://huggingface.co/occiglot/occiglot-7b-eu5), i.e.
|
51 |
|
52 |
The data distribution by language (estimated) is as follows:
|
53 |
- English: ~34%
|
54 |
- Code: ~13%
|
55 |
-
-
|
56 |
|
57 |
The training data was prepared using [lm-datasets](https://github.com/malteos/lm-datasets).
|
58 |
The exact data configuration is [here](https://huggingface.co/occiglot/occiglot-7b-eu5/blob/main/lm-datasets-config.yml).
|
|
|
2 |
license: apache-2.0
|
3 |
language:
|
4 |
- en
|
5 |
+
- it
|
6 |
pipeline_tag: text-generation
|
7 |
---
|
8 |
|
9 |
![image/png](https://huggingface.co/datasets/malteos/images/resolve/main/occiglot.medium.png)
|
10 |
|
11 |
+
# Occiglot-7B-IT-EN
|
12 |
|
13 |
> A [polyglot](https://en.wikipedia.org/wiki/Multilingualism#In_individuals) language model for the [Occident](https://en.wikipedia.org/wiki/Occident).
|
14 |
>
|
15 |
|
16 |
+
**Occiglot-7B-IT-EN** is a generative language model with 7B parameters for Italian and English and trained by the [Occiglot Research Collective](https://ociglot.eu).
|
17 |
It is based on [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) and trained on 113B tokens of additional multilingual and code data with a block size of 8,192 tokens per sample.
|
18 |
+
Note that the model is a general-purpose base model and was not instruction-fine-tuned nor optimized for chat or other applications. We make an instruction tuned variant available as [occiglot-7b-it-en-instruct](https://huggingface.co/occiglot/occiglot-7b-fr-en-instruct)
|
19 |
|
20 |
This is the first release of an ongoing open research project for multilingual language models.
|
21 |
If you want to train a model for your own language or are working on evaluations, please contact us or join our [Discord server](https://discord.gg/wUpvYs4XvM). **We are open for collaborations!**
|
|
|
25 |
|
26 |
- **Continued-pretraining from:** [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
|
27 |
- **Model type:** Causal decoder-only transformer language model
|
28 |
+
- **Languages:** English, Italian, and code.
|
29 |
- **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0.html)
|
30 |
- **Compute resources:** [HessianAI's 42](https://hessian.ai/)
|
31 |
- **Contributors:** Manuel Brack, Patrick Schramowski, Pedro Ortiz, Malte Ostendorff, Fabio Barth, Georg Rehm, Kristian Kersting
|
|
|
39 |
|
40 |
```python
|
41 |
>>> from transformers import pipeline, set_seed
|
42 |
+
>>> generator = pipeline('text-generation', model='occiglot/occiglot-7b-it-en')
|
43 |
>>> set_seed(42)
|
44 |
+
>>> generator("Salve, sono una modella linguistica,", max_length=40, num_return_sequences=1)
|
45 |
+
[{'generated_text': 'Salve, sono una modella linguistica che può aiutarvi a tradurre testi tra l'italiano e l'inglese. Se mi inviate un testo in italiano'}]
|
46 |
```
|
47 |
|
48 |
## Dataset
|
49 |
|
50 |
+
The training data is the respective subset of the data used for [occiglot-7b-eu5](https://huggingface.co/occiglot/occiglot-7b-eu5), i.e. Italian plus English and Code.
|
51 |
|
52 |
The data distribution by language (estimated) is as follows:
|
53 |
- English: ~34%
|
54 |
- Code: ~13%
|
55 |
+
- Italian: ~52%
|
56 |
|
57 |
The training data was prepared using [lm-datasets](https://github.com/malteos/lm-datasets).
|
58 |
The exact data configuration is [here](https://huggingface.co/occiglot/occiglot-7b-eu5/blob/main/lm-datasets-config.yml).
|