|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
- es |
|
- de |
|
- fr |
|
- it |
|
pipeline_tag: text-generation |
|
--- |
|
|
|
![image/png](https://huggingface.co/datasets/malteos/images/resolve/main/occiglot.medium.png) |
|
|
|
# Occiglot-7B-EU5 |
|
|
|
> A [polyglot](https://en.wikipedia.org/wiki/Multilingualism#In_individuals) language model for the [Occident](https://en.wikipedia.org/wiki/Occident). |
|
> |
|
|
|
**Occiglot-7B-EU5** is a generative language model with 7B parameters supporting the top-5 EU languages (English, Spanish, French, German, and Italian) and trained by the [German Research Center for Artificial Intelligence (DFKI)](https://www.dfki.de/en/web). |
|
It is based on [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) and trained on 293B tokens of additional multilingual and code data with a block size of 8,192 tokens per sample. |
|
Note that the model is a general-purpose base model and was not instruction-fine-tuned nor optimized for chat or other applications. |
|
|
|
This is the first release of an ongoing open research project for multilingual language models. |
|
If you want to train a model for your own language or are working on evaluations, please contact us. **We are open for collaborations!** |
|
|
|
|
|
### Model details |
|
|
|
- **Continued-pretraining from:** [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) |
|
- **Model type:** Causal decoder-only transformer language model |
|
- **Languages:** English, Spanish, French, German, Italian, and code. |
|
- **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0.html) |
|
- **Compute resources:** [HessianAI's 42](https://hessian.ai/) |
|
- **Contributors:** Manuel Brack, Patrick Schramowski, Pedro Ortiz, Malte Ostendorff, Fabio Barth, Georg Rehm, Kristian Kersting |
|
- **Research labs:** [SAINT](https://www.dfki.de/en/web/research/research-departments/foundations-of-systems-ai) and [SLT](https://www.dfki.de/en/web/research/research-departments/speech-and-language-technology) |
|
|
|
### How to use |
|
|
|
You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we |
|
set a seed for reproducibility: |
|
|
|
```python |
|
>>> from transformers import pipeline, set_seed |
|
>>> generator = pipeline('text-generation', model='occiglot/occiglot-7b-eu5') |
|
>>> set_seed(42) |
|
>>> generator("Hallo, Ich bin ein Sprachmodell,", max_length=40, num_return_sequences=1) |
|
[{'generated_text': 'Hallo, Ich bin ein Sprachmodell, das dir bei der Übersetzung von Texten zwischen Deutsch und Englisch helfen kann. Wenn du mir einen Text in Deutsch'}] |
|
``` |
|
|
|
## Dataset |
|
|
|
The training data was split amongst the 4 target languages (de, es, fr, it) and the continuous training in English and code. |
|
|
|
The data distribution by language (estimated) is as follows: |
|
- English: ~13% |
|
- Code: ~5% |
|
- German: ~20% |
|
- Spanish: ~20% |
|
- French: ~20% |
|
- Italian: ~20% |
|
|
|
The training data was prepared using [lm-datasets](https://github.com/malteos/lm-datasets). |
|
The exact data config will be released soon. |
|
|
|
## Training settings |
|
|
|
- Continual pre-training on 128 x A100-80GB on [HessianAI's 42](https://hessian.ai/). |
|
- Framework: [Determined](https://www.determined.ai/) |
|
- Precision: bf16 |
|
- Optimizer: AdamW (lr: 0.00001, warmup_steps: 420) |
|
- Global batch size: 512 (with 8192 blocksize) split over 128 GPUs |
|
- Cosine Annealing with Warmup |
|
|
|
|
|
## Tokenizer |
|
|
|
Tokenizer is unchanged from [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1). |
|
|
|
## Evaluation |
|
|
|
Preliminary evaluation results can be found below. |
|
Please note that the non-English results are based on partially machine-translated datasets and English prompts ([Belebele](https://huggingface.co/datasets/facebook/belebele) and [Okapi framework](https://github.com/nlp-uoregon/Okapi)) and thus should be interpreted with caution, e.g., biased towards English model performance. |
|
Currently, we are working on more suitable benchmarks for Spanish, French, German, and Italian. |
|
|
|
### All languages |
|
|
|
| **model_name** | **arc_challenge** | **hellaswag** | **belebele** | **mmlu** | **avg** | |
|
|--------------------------|-------------------|---------------|--------------|----------|---------| |
|
| Mistral-7B-v0.1 | 0.5277 | 0.6825 | 0.7687 | 0.6287 | 0.6519 | |
|
| leo-mistral-hessianai-7b | 0.4614 | 0.6423 | 0.6524 | 0.5440 | 0.5750 | |
|
| Occiglot-7B-EU5 | 0.5083 | 0.7191 | 0.6758 | 0.5432 | 0.6116 | |
|
|
|
### English |
|
|
|
| **model_name** | **arc_challenge** | **hellaswag** | **belebele** | **mmlu** | **avg** | |
|
|--------------------------|-------------------|---------------|--------------|----------|---------| |
|
| Mistral-7B-v0.1 | 0.6143 | 0.8344 | 0.8444 | 0.6351 | 0.7321 | |
|
| leo-mistral-hessianai-7b | 0.5213 | 0.7779 | 0.7356 | 0.5508 | 0.6464 | |
|
| Occiglot-7B-EU5 | 0.5307 | 0.7900 | 0.7267 | 0.5467 | 0.6485 | |
|
|
|
### German |
|
|
|
| **model_name** | **arc_challenge** | **hellaswag** | **belebele** | **mmlu** | **avg** | |
|
|--------------------------|-------------------|---------------|--------------|----------|---------| |
|
| Mistral-7B-v0.1 | 0.4765 | 0.6101 | 0.7411 | 0.5274 | 0.5888 | |
|
| leo-mistral-hessianai-7b | 0.4739 | 0.6818 | 0.6900 | 0.4887 | 0.5836 | |
|
| Occiglot-7B-EU5 | 0.4944 | 0.6667 | 0.6467 | 0.4833 | 0.5728 | |
|
|
|
### Spanish |
|
|
|
| **model_name** | **arc_challenge** | **hellaswag** | **belebele** | **mmlu** | **avg** | |
|
|--------------------------|-------------------|---------------|--------------|----------|---------| |
|
| Mistral-7B-v0.1 | 0.5256 | 0.6728 | 0.7478 | 0.5432 | 0.6224 | |
|
| leo-mistral-hessianai-7b | 0.4436 | 0.5970 | 0.6178 | 0.4359 | 0.5236 | |
|
| Occiglot-7B-EU5 | 0.5085 | 0.7255 | 0.6778 | 0.4997 | 0.6029 | |
|
|
|
### French |
|
|
|
| **model_name** | **arc_challenge** | **hellaswag** | **belebele** | **mmlu** | **avg** | |
|
|--------------------------|-------------------|---------------|--------------|----------|---------| |
|
| Mistral-7B-v0.1 | 0.5244 | 0.6651 | 0.7744 | 0.5413 | 0.6263 | |
|
| leo-mistral-hessianai-7b | 0.4354 | 0.5967 | 0.6222 | 0.4326 | 0.5217 | |
|
| Occiglot-7B-EU5 | 0.5064 | 0.7125 | 0.6756 | 0.4959 | 0.5976 | |
|
|
|
### Italian |
|
|
|
| **model_name** | **arc_challenge** | **hellaswag** | **belebele** | **mmlu** | **avg** | |
|
|--------------------------|-------------------|---------------|--------------|----------|---------| |
|
| Mistral-7B-v0.1 | 0.4979 | 0.6303 | 0.7356 | 0.5372 | 0.6002 | |
|
| leo-mistral-hessianai-7b | 0.4328 | 0.5580 | 0.5967 | 0.4311 | 0.5047 | |
|
| Occiglot-7B-EU5 | 0.5013 | 0.7008 | 0.6522 | 0.4949 | 0.5873 | |
|
|
|
|
|
|
|
## Acknowledgements |
|
|
|
The model training was supported by a compute grant at the [42 supercomputer](https://hessian.ai/) which is a central component in the development of [hessian AI](https://hessian.ai/), the [AI Innovation Lab](https://hessian.ai/infrastructure/ai-innovationlab/) ([HMWK](https://wissenschaft.hessen.de) & [HMinD](https://innen.hessen.de)) and the [AI Service Centers](https://hessian.ai/infrastructure/ai-service-centre/) ([BMBF](https://www.bmbf.de/bmbf/en/home/home_node.html)). |
|
The curation of the training data is partially funded by the [German Federal Ministry for Economic Affairs and Climate Action (BMWK)](https://www.bmwk.de/Navigation/EN/Home/home.html) |
|
through the project [OpenGPT-X](https://opengpt-x.de/en/) (project no. 68GX21007D). |
|
|
|
|
|
## License |
|
|
|
[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0.html) |
|
|
|
|
|
|