occiglot-7b-eu5 / README.md

uploading model files

e2ef0ce verified 9 months ago

7.85 kB

	---
	license: apache-2.0
	language:
	- en
	- es
	- de
	- fr
	- it
	pipeline_tag: text-generation
	---

	![image/png](https://huggingface.co/datasets/malteos/images/resolve/main/occiglot.medium.png)

	# Occiglot-7B-EU5

	> A [polyglot](https://en.wikipedia.org/wiki/Multilingualism#In_individuals) language model for the [Occident](https://en.wikipedia.org/wiki/Occident).
	>

	Occiglot-7B-EU5 is a generative language model with 7B parameters supporting the top-5 EU languages (English, Spanish, French, German, and Italian) and trained by the [German Research Center for Artificial Intelligence (DFKI)](https://www.dfki.de/en/web).
	It is based on [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) and trained on 293B tokens of additional multilingual and code data with a block size of 8,192 tokens per sample.
	Note that the model is a general-purpose base model and was not instruction-fine-tuned nor optimized for chat or other applications.

	This is the first release of an ongoing open research project for multilingual language models.
	If you want to train a model for your own language or are working on evaluations, please contact us. We are open for collaborations!


	### Model details

	- Continued-pretraining from: [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
	- Model type: Causal decoder-only transformer language model
	- Languages: English, Spanish, French, German, Italian, and code.
	- License: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0.html)
	- Compute resources: [HessianAI's 42](https://hessian.ai/)
	- Contributors: Manuel Brack, Patrick Schramowski, Pedro Ortiz, Malte Ostendorff, Fabio Barth, Georg Rehm, Kristian Kersting
	- Research labs: [SAINT](https://www.dfki.de/en/web/research/research-departments/foundations-of-systems-ai) and [SLT](https://www.dfki.de/en/web/research/research-departments/speech-and-language-technology)

	### How to use

	You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we
	set a seed for reproducibility:

	```python
	>>> from transformers import pipeline, set_seed
	>>> generator = pipeline('text-generation', model='occiglot/occiglot-7b-eu5')
	>>> set_seed(42)
	>>> generator("Hallo, Ich bin ein Sprachmodell,", max_length=40, num_return_sequences=1)
	[{'generated_text': 'Hallo, Ich bin ein Sprachmodell, das dir bei der Übersetzung von Texten zwischen Deutsch und Englisch helfen kann. Wenn du mir einen Text in Deutsch'}]
	```

	## Dataset

	The training data was split amongst the 4 target languages (de, es, fr, it) and the continuous training in English and code.

	The data distribution by language (estimated) is as follows:
	- English: ~13%
	- Code: ~5%
	- German: ~20%
	- Spanish: ~20%
	- French: ~20%
	- Italian: ~20%

	The training data was prepared using [lm-datasets](https://github.com/malteos/lm-datasets).
	The exact data config will be released soon.

	## Training settings

	- Continual pre-training on 128 x A100-80GB on [HessianAI's 42](https://hessian.ai/).
	- Framework: [Determined](https://www.determined.ai/)
	- Precision: bf16
	- Optimizer: AdamW (lr: 0.00001, warmup_steps: 420)
	- Global batch size: 512 (with 8192 blocksize) split over 128 GPUs
	- Cosine Annealing with Warmup


	## Tokenizer

	Tokenizer is unchanged from [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1).

	## Evaluation

	Preliminary evaluation results can be found below.
	Please note that the non-English results are based on partially machine-translated datasets and English prompts ([Belebele](https://huggingface.co/datasets/facebook/belebele) and [Okapi framework](https://github.com/nlp-uoregon/Okapi)) and thus should be interpreted with caution, e.g., biased towards English model performance.
	Currently, we are working on more suitable benchmarks for Spanish, French, German, and Italian.

	### All languages

	\| model_name \| arc_challenge \| hellaswag \| belebele \| mmlu \| avg \|
	\|--------------------------\|-------------------\|---------------\|--------------\|----------\|---------\|
	\| Mistral-7B-v0.1 \| 0.5277 \| 0.6825 \| 0.7687 \| 0.6287 \| 0.6519 \|
	\| leo-mistral-hessianai-7b \| 0.4614 \| 0.6423 \| 0.6524 \| 0.5440 \| 0.5750 \|
	\| Occiglot-7B-EU5 \| 0.5083 \| 0.7191 \| 0.6758 \| 0.5432 \| 0.6116 \|

	### English

	\| model_name \| arc_challenge \| hellaswag \| belebele \| mmlu \| avg \|
	\|--------------------------\|-------------------\|---------------\|--------------\|----------\|---------\|
	\| Mistral-7B-v0.1 \| 0.6143 \| 0.8344 \| 0.8444 \| 0.6351 \| 0.7321 \|
	\| leo-mistral-hessianai-7b \| 0.5213 \| 0.7779 \| 0.7356 \| 0.5508 \| 0.6464 \|
	\| Occiglot-7B-EU5 \| 0.5307 \| 0.7900 \| 0.7267 \| 0.5467 \| 0.6485 \|

	### German

	\| model_name \| arc_challenge \| hellaswag \| belebele \| mmlu \| avg \|
	\|--------------------------\|-------------------\|---------------\|--------------\|----------\|---------\|
	\| Mistral-7B-v0.1 \| 0.4765 \| 0.6101 \| 0.7411 \| 0.5274 \| 0.5888 \|
	\| leo-mistral-hessianai-7b \| 0.4739 \| 0.6818 \| 0.6900 \| 0.4887 \| 0.5836 \|
	\| Occiglot-7B-EU5 \| 0.4944 \| 0.6667 \| 0.6467 \| 0.4833 \| 0.5728 \|

	### Spanish

	\| model_name \| arc_challenge \| hellaswag \| belebele \| mmlu \| avg \|
	\|--------------------------\|-------------------\|---------------\|--------------\|----------\|---------\|
	\| Mistral-7B-v0.1 \| 0.5256 \| 0.6728 \| 0.7478 \| 0.5432 \| 0.6224 \|
	\| leo-mistral-hessianai-7b \| 0.4436 \| 0.5970 \| 0.6178 \| 0.4359 \| 0.5236 \|
	\| Occiglot-7B-EU5 \| 0.5085 \| 0.7255 \| 0.6778 \| 0.4997 \| 0.6029 \|

	### French

	\| model_name \| arc_challenge \| hellaswag \| belebele \| mmlu \| avg \|
	\|--------------------------\|-------------------\|---------------\|--------------\|----------\|---------\|
	\| Mistral-7B-v0.1 \| 0.5244 \| 0.6651 \| 0.7744 \| 0.5413 \| 0.6263 \|
	\| leo-mistral-hessianai-7b \| 0.4354 \| 0.5967 \| 0.6222 \| 0.4326 \| 0.5217 \|
	\| Occiglot-7B-EU5 \| 0.5064 \| 0.7125 \| 0.6756 \| 0.4959 \| 0.5976 \|

	### Italian

	\| model_name \| arc_challenge \| hellaswag \| belebele \| mmlu \| avg \|
	\|--------------------------\|-------------------\|---------------\|--------------\|----------\|---------\|
	\| Mistral-7B-v0.1 \| 0.4979 \| 0.6303 \| 0.7356 \| 0.5372 \| 0.6002 \|
	\| leo-mistral-hessianai-7b \| 0.4328 \| 0.5580 \| 0.5967 \| 0.4311 \| 0.5047 \|
	\| Occiglot-7B-EU5 \| 0.5013 \| 0.7008 \| 0.6522 \| 0.4949 \| 0.5873 \|



	## Acknowledgements

	The model training was supported by a compute grant at the [42 supercomputer](https://hessian.ai/) which is a central component in the development of [hessian AI](https://hessian.ai/), the [AI Innovation Lab](https://hessian.ai/infrastructure/ai-innovationlab/) ([HMWK](https://wissenschaft.hessen.de) & [HMinD](https://innen.hessen.de)) and the [AI Service Centers](https://hessian.ai/infrastructure/ai-service-centre/) ([BMBF](https://www.bmbf.de/bmbf/en/home/home_node.html)).
	The curation of the training data is partially funded by the [German Federal Ministry for Economic Affairs and Climate Action (BMWK)](https://www.bmwk.de/Navigation/EN/Home/home.html)
	through the project [OpenGPT-X](https://opengpt-x.de/en/) (project no. 68GX21007D).


	## License

	[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0.html)