File size: 1,379 Bytes
c0c608e
 
 
15fda87
c0c608e
15fda87
 
 
 
 
c0c608e
 
 
 
 
 
 
 
 
 
 
 
 
 
f43e254
c0c608e
 
 
 
 
 
 
 
 
26cbbfb
 
c0c608e
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
---
language:
- de
license: bigscience-bloom-rail-1.0
library_name: transformers
tags:
- ggml
- bloom
datasets:
- oscar
pipeline_tag: text-generation
---

# BLOOM-CLP German (6.4B parameters)

This is a monolingual German language model trained using the [CLP-Transfer](https://arxiv.org/abs/2301.09626) method based on [BLOOM-7b1](https://huggingface.co/bigscience/bloom-7b1).

You can try out the model at [European Language Grid](https://live.european-language-grid.eu/catalogue/tool-service/20825/try%20out/).

## Training dataset

- ca. 50B German tokens
- Web-crawled content from the German subset [OSCAR v22.01](https://oscar-corpus.com/post/oscar-v22-01/) (excluding content tagged as header, footer, noisy, or adult)
- Web-crawled content from the [GC4 Corpus](https://german-nlp-group.github.io/projects/gc4-corpus.html) (including only the head and middle parts)
- Both Web-crawled datasets are deduplicated with [Google's suffix array implementation](https://github.com/google-research/deduplicate-text-datasets)
- German court decisions from [Open Legal Data](http://openlegaldata.io/)

## Code

- [BigScience's Megatron-Deepspeed fork](https://github.com/bigscience-workshop/Megatron-DeepSpeed)

## Hardware

- 32xA100-40GB GPUs
- 12.5 days
- [Tensorboard logs](https://huggingface.co/malteos/bloom-6b4-clp-german-logs/tensorboard)

## Evaluation

TBA (see paper)