malteos's picture
Add ggml-model-bloom-6b4-clp-german-f16.bin from bloomz.cpp converter. (#2)
15fda87
|
raw
history blame
No virus
1.38 kB
---
language:
- de
license: bigscience-bloom-rail-1.0
library_name: transformers
tags:
- ggml
- bloom
datasets:
- oscar
pipeline_tag: text-generation
---
# BLOOM-CLP German (6.4B parameters)
This is a monolingual German language model trained using the [CLP-Transfer](https://arxiv.org/abs/2301.09626) method based on [BLOOM-7b1](https://huggingface.co/bigscience/bloom-7b1).
You can try out the model at [European Language Grid](https://live.european-language-grid.eu/catalogue/tool-service/20825/try%20out/).
## Training dataset
- ca. 50B German tokens
- Web-crawled content from the German subset [OSCAR v22.01](https://oscar-corpus.com/post/oscar-v22-01/) (excluding content tagged as header, footer, noisy, or adult)
- Web-crawled content from the [GC4 Corpus](https://german-nlp-group.github.io/projects/gc4-corpus.html) (including only the head and middle parts)
- Both Web-crawled datasets are deduplicated with [Google's suffix array implementation](https://github.com/google-research/deduplicate-text-datasets)
- German court decisions from [Open Legal Data](http://openlegaldata.io/)
## Code
- [BigScience's Megatron-Deepspeed fork](https://github.com/bigscience-workshop/Megatron-DeepSpeed)
## Hardware
- 32xA100-40GB GPUs
- 12.5 days
- [Tensorboard logs](https://huggingface.co/malteos/bloom-6b4-clp-german-logs/tensorboard)
## Evaluation
TBA (see paper)