occiglot-7b-eu5 / README.md
malteos's picture
uploading model files
e2ef0ce verified
|
raw
history blame
7.85 kB
---
license: apache-2.0
language:
- en
- es
- de
- fr
- it
pipeline_tag: text-generation
---
![image/png](https://huggingface.co/datasets/malteos/images/resolve/main/occiglot.medium.png)
# Occiglot-7B-EU5
> A [polyglot](https://en.wikipedia.org/wiki/Multilingualism#In_individuals) language model for the [Occident](https://en.wikipedia.org/wiki/Occident).
>
**Occiglot-7B-EU5** is a generative language model with 7B parameters supporting the top-5 EU languages (English, Spanish, French, German, and Italian) and trained by the [German Research Center for Artificial Intelligence (DFKI)](https://www.dfki.de/en/web).
It is based on [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) and trained on 293B tokens of additional multilingual and code data with a block size of 8,192 tokens per sample.
Note that the model is a general-purpose base model and was not instruction-fine-tuned nor optimized for chat or other applications.
This is the first release of an ongoing open research project for multilingual language models.
If you want to train a model for your own language or are working on evaluations, please contact us. **We are open for collaborations!**
### Model details
- **Continued-pretraining from:** [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
- **Model type:** Causal decoder-only transformer language model
- **Languages:** English, Spanish, French, German, Italian, and code.
- **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0.html)
- **Compute resources:** [HessianAI's 42](https://hessian.ai/)
- **Contributors:** Manuel Brack, Patrick Schramowski, Pedro Ortiz, Malte Ostendorff, Fabio Barth, Georg Rehm, Kristian Kersting
- **Research labs:** [SAINT](https://www.dfki.de/en/web/research/research-departments/foundations-of-systems-ai) and [SLT](https://www.dfki.de/en/web/research/research-departments/speech-and-language-technology)
### How to use
You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we
set a seed for reproducibility:
```python
>>> from transformers import pipeline, set_seed
>>> generator = pipeline('text-generation', model='occiglot/occiglot-7b-eu5')
>>> set_seed(42)
>>> generator("Hallo, Ich bin ein Sprachmodell,", max_length=40, num_return_sequences=1)
[{'generated_text': 'Hallo, Ich bin ein Sprachmodell, das dir bei der Übersetzung von Texten zwischen Deutsch und Englisch helfen kann. Wenn du mir einen Text in Deutsch'}]
```
## Dataset
The training data was split amongst the 4 target languages (de, es, fr, it) and the continuous training in English and code.
The data distribution by language (estimated) is as follows:
- English: ~13%
- Code: ~5%
- German: ~20%
- Spanish: ~20%
- French: ~20%
- Italian: ~20%
The training data was prepared using [lm-datasets](https://github.com/malteos/lm-datasets).
The exact data config will be released soon.
## Training settings
- Continual pre-training on 128 x A100-80GB on [HessianAI's 42](https://hessian.ai/).
- Framework: [Determined](https://www.determined.ai/)
- Precision: bf16
- Optimizer: AdamW (lr: 0.00001, warmup_steps: 420)
- Global batch size: 512 (with 8192 blocksize) split over 128 GPUs
- Cosine Annealing with Warmup
## Tokenizer
Tokenizer is unchanged from [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1).
## Evaluation
Preliminary evaluation results can be found below.
Please note that the non-English results are based on partially machine-translated datasets and English prompts ([Belebele](https://huggingface.co/datasets/facebook/belebele) and [Okapi framework](https://github.com/nlp-uoregon/Okapi)) and thus should be interpreted with caution, e.g., biased towards English model performance.
Currently, we are working on more suitable benchmarks for Spanish, French, German, and Italian.
### All languages
| **model_name** | **arc_challenge** | **hellaswag** | **belebele** | **mmlu** | **avg** |
|--------------------------|-------------------|---------------|--------------|----------|---------|
| Mistral-7B-v0.1 | 0.5277 | 0.6825 | 0.7687 | 0.6287 | 0.6519 |
| leo-mistral-hessianai-7b | 0.4614 | 0.6423 | 0.6524 | 0.5440 | 0.5750 |
| Occiglot-7B-EU5 | 0.5083 | 0.7191 | 0.6758 | 0.5432 | 0.6116 |
### English
| **model_name** | **arc_challenge** | **hellaswag** | **belebele** | **mmlu** | **avg** |
|--------------------------|-------------------|---------------|--------------|----------|---------|
| Mistral-7B-v0.1 | 0.6143 | 0.8344 | 0.8444 | 0.6351 | 0.7321 |
| leo-mistral-hessianai-7b | 0.5213 | 0.7779 | 0.7356 | 0.5508 | 0.6464 |
| Occiglot-7B-EU5 | 0.5307 | 0.7900 | 0.7267 | 0.5467 | 0.6485 |
### German
| **model_name** | **arc_challenge** | **hellaswag** | **belebele** | **mmlu** | **avg** |
|--------------------------|-------------------|---------------|--------------|----------|---------|
| Mistral-7B-v0.1 | 0.4765 | 0.6101 | 0.7411 | 0.5274 | 0.5888 |
| leo-mistral-hessianai-7b | 0.4739 | 0.6818 | 0.6900 | 0.4887 | 0.5836 |
| Occiglot-7B-EU5 | 0.4944 | 0.6667 | 0.6467 | 0.4833 | 0.5728 |
### Spanish
| **model_name** | **arc_challenge** | **hellaswag** | **belebele** | **mmlu** | **avg** |
|--------------------------|-------------------|---------------|--------------|----------|---------|
| Mistral-7B-v0.1 | 0.5256 | 0.6728 | 0.7478 | 0.5432 | 0.6224 |
| leo-mistral-hessianai-7b | 0.4436 | 0.5970 | 0.6178 | 0.4359 | 0.5236 |
| Occiglot-7B-EU5 | 0.5085 | 0.7255 | 0.6778 | 0.4997 | 0.6029 |
### French
| **model_name** | **arc_challenge** | **hellaswag** | **belebele** | **mmlu** | **avg** |
|--------------------------|-------------------|---------------|--------------|----------|---------|
| Mistral-7B-v0.1 | 0.5244 | 0.6651 | 0.7744 | 0.5413 | 0.6263 |
| leo-mistral-hessianai-7b | 0.4354 | 0.5967 | 0.6222 | 0.4326 | 0.5217 |
| Occiglot-7B-EU5 | 0.5064 | 0.7125 | 0.6756 | 0.4959 | 0.5976 |
### Italian
| **model_name** | **arc_challenge** | **hellaswag** | **belebele** | **mmlu** | **avg** |
|--------------------------|-------------------|---------------|--------------|----------|---------|
| Mistral-7B-v0.1 | 0.4979 | 0.6303 | 0.7356 | 0.5372 | 0.6002 |
| leo-mistral-hessianai-7b | 0.4328 | 0.5580 | 0.5967 | 0.4311 | 0.5047 |
| Occiglot-7B-EU5 | 0.5013 | 0.7008 | 0.6522 | 0.4949 | 0.5873 |
## Acknowledgements
The model training was supported by a compute grant at the [42 supercomputer](https://hessian.ai/) which is a central component in the development of [hessian AI](https://hessian.ai/), the [AI Innovation Lab](https://hessian.ai/infrastructure/ai-innovationlab/) ([HMWK](https://wissenschaft.hessen.de) & [HMinD](https://innen.hessen.de)) and the [AI Service Centers](https://hessian.ai/infrastructure/ai-service-centre/) ([BMBF](https://www.bmbf.de/bmbf/en/home/home_node.html)).
The curation of the training data is partially funded by the [German Federal Ministry for Economic Affairs and Climate Action (BMWK)](https://www.bmwk.de/Navigation/EN/Home/home.html)
through the project [OpenGPT-X](https://opengpt-x.de/en/) (project no. 68GX21007D).
## License
[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0.html)