File size: 7,848 Bytes

5479f93
 
e2ef0ce
 
 
 
 
 
 
5479f93
e2ef0ce

---
license: apache-2.0
language:
- en
- es
- de
- fr
- it
pipeline_tag: text-generation
---

![image/png](https://huggingface.co/datasets/malteos/images/resolve/main/occiglot.medium.png)

# Occiglot-7B-EU5

> A [polyglot](https://en.wikipedia.org/wiki/Multilingualism#In_individuals) language model for the [Occident](https://en.wikipedia.org/wiki/Occident).
> 

**Occiglot-7B-EU5** is a generative language model with 7B parameters supporting the top-5 EU languages (English, Spanish, French, German, and Italian) and trained by the [German Research Center for Artificial Intelligence (DFKI)](https://www.dfki.de/en/web).
It is based on [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) and trained on 293B tokens of additional multilingual and code data with a block size of 8,192 tokens per sample.
Note that the model is a general-purpose base model and was not instruction-fine-tuned nor optimized for chat or other applications.

This is the first release of an ongoing open research project for multilingual language models. 
If you want to train a model for your own language or are working on evaluations, please contact us. **We are open for collaborations!**


### Model details

- **Continued-pretraining from:** [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
- **Model type:** Causal decoder-only transformer language model
- **Languages:** English, Spanish, French, German, Italian, and code.
- **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0.html)
- **Compute resources:** [HessianAI's 42](https://hessian.ai/)
- **Contributors:** Manuel Brack, Patrick Schramowski, Pedro Ortiz, Malte Ostendorff, Fabio Barth, Georg Rehm, Kristian Kersting
- **Research labs:** [SAINT](https://www.dfki.de/en/web/research/research-departments/foundations-of-systems-ai) and [SLT](https://www.dfki.de/en/web/research/research-departments/speech-and-language-technology)

### How to use

You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we
set a seed for reproducibility:

```python
>>> from transformers import pipeline, set_seed
>>> generator = pipeline('text-generation', model='occiglot/occiglot-7b-eu5')
>>> set_seed(42)
>>> generator("Hallo, Ich bin ein Sprachmodell,", max_length=40, num_return_sequences=1)
[{'generated_text': 'Hallo, Ich bin ein Sprachmodell, das dir bei der Übersetzung von Texten zwischen Deutsch und Englisch helfen kann. Wenn du mir einen Text in Deutsch'}]
```

## Dataset

The training data was split amongst the 4 target languages (de, es, fr, it) and the continuous training in English and code. 

The data distribution by language (estimated) is as follows:
- English: ~13%
- Code: ~5%
- German: ~20%
- Spanish: ~20%
- French: ~20%
- Italian: ~20%

The training data was prepared using [lm-datasets](https://github.com/malteos/lm-datasets). 
The exact data config will be released soon.

## Training settings

- Continual pre-training on 128 x A100-80GB on [HessianAI's 42](https://hessian.ai/). 
- Framework: [Determined](https://www.determined.ai/)
- Precision: bf16
- Optimizer: AdamW (lr: 0.00001, warmup_steps: 420)
- Global batch size: 512 (with 8192 blocksize) split over 128 GPUs
- Cosine Annealing with Warmup


## Tokenizer

Tokenizer is unchanged from [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1).

## Evaluation

Preliminary evaluation results can be found below. 
Please note that the non-English results are based on partially machine-translated datasets and English prompts ([Belebele](https://huggingface.co/datasets/facebook/belebele) and [Okapi framework](https://github.com/nlp-uoregon/Okapi)) and thus should be interpreted with caution, e.g., biased towards English model performance.
Currently, we are working on more suitable benchmarks for Spanish, French, German, and Italian.

### All languages

| **model_name**           | **arc_challenge** | **hellaswag** | **belebele** | **mmlu** | **avg** |
|--------------------------|-------------------|---------------|--------------|----------|---------|
| Mistral-7B-v0.1          |            0.5277 |        0.6825 |       0.7687 |   0.6287 |  0.6519 |
| leo-mistral-hessianai-7b |            0.4614 |        0.6423 |       0.6524 |   0.5440 |  0.5750 |
| Occiglot-7B-EU5          |            0.5083 |        0.7191 |       0.6758 |   0.5432 |  0.6116 |

### English

| **model_name**           | **arc_challenge** | **hellaswag** | **belebele** | **mmlu** | **avg** |
|--------------------------|-------------------|---------------|--------------|----------|---------|
| Mistral-7B-v0.1          |            0.6143 |        0.8344 |       0.8444 |   0.6351 |  0.7321 |
| leo-mistral-hessianai-7b |            0.5213 |        0.7779 |       0.7356 |   0.5508 |  0.6464 |
| Occiglot-7B-EU5          |            0.5307 |        0.7900 |       0.7267 |   0.5467 |  0.6485 |

### German

| **model_name**           | **arc_challenge** | **hellaswag** | **belebele** | **mmlu** | **avg** |
|--------------------------|-------------------|---------------|--------------|----------|---------|
| Mistral-7B-v0.1          |            0.4765 |        0.6101 |       0.7411 |   0.5274 |  0.5888 |
| leo-mistral-hessianai-7b |            0.4739 |        0.6818 |       0.6900 |   0.4887 |  0.5836 |
| Occiglot-7B-EU5          |            0.4944 |        0.6667 |       0.6467 |   0.4833 |  0.5728 |

### Spanish

| **model_name**           | **arc_challenge** | **hellaswag** | **belebele** | **mmlu** | **avg** |
|--------------------------|-------------------|---------------|--------------|----------|---------|
| Mistral-7B-v0.1          |            0.5256 |        0.6728 |       0.7478 |   0.5432 |  0.6224 |
| leo-mistral-hessianai-7b |            0.4436 |        0.5970 |       0.6178 |   0.4359 |  0.5236 |
| Occiglot-7B-EU5          |            0.5085 |        0.7255 |       0.6778 |   0.4997 |  0.6029 |

### French

| **model_name**           | **arc_challenge** | **hellaswag** | **belebele** | **mmlu** | **avg** |
|--------------------------|-------------------|---------------|--------------|----------|---------|
| Mistral-7B-v0.1          |            0.5244 |        0.6651 |       0.7744 |   0.5413 |  0.6263 |
| leo-mistral-hessianai-7b |            0.4354 |        0.5967 |       0.6222 |   0.4326 |  0.5217 |
| Occiglot-7B-EU5          |            0.5064 |        0.7125 |       0.6756 |   0.4959 |  0.5976 |

### Italian

| **model_name**           | **arc_challenge** | **hellaswag** | **belebele** | **mmlu** | **avg** |
|--------------------------|-------------------|---------------|--------------|----------|---------|
| Mistral-7B-v0.1          |            0.4979 |        0.6303 |       0.7356 |   0.5372 |  0.6002 |
| leo-mistral-hessianai-7b |            0.4328 |        0.5580 |       0.5967 |   0.4311 |  0.5047 |
| Occiglot-7B-EU5 |            0.5013 |        0.7008 |       0.6522 |   0.4949 |  0.5873 |



## Acknowledgements

The model training was supported by a compute grant at the [42 supercomputer](https://hessian.ai/)  which is a central component in the development of [hessian AI](https://hessian.ai/), the [AI Innovation Lab](https://hessian.ai/infrastructure/ai-innovationlab/) ([HMWK](https://wissenschaft.hessen.de) & [HMinD](https://innen.hessen.de)) and the [AI Service Centers](https://hessian.ai/infrastructure/ai-service-centre/) ([BMBF](https://www.bmbf.de/bmbf/en/home/home_node.html)).
The curation of the training data is partially funded by the [German Federal Ministry for Economic Affairs and Climate Action (BMWK)](https://www.bmwk.de/Navigation/EN/Home/home.html)
through the project [OpenGPT-X](https://opengpt-x.de/en/) (project no. 68GX21007D).


## License

[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0.html)