|
--- |
|
language: |
|
- de |
|
tags: |
|
- german |
|
- causal-lm |
|
- text-generation |
|
library_name: transformers |
|
pipeline_tag: text-generation |
|
license: apache-2.0 |
|
--- |
|
|
|
# BübleLM |
|
|
|
|
|
<div align="center" style="margin-bottom: 2rem; margin-top: 2rem"> |
|
<img src="https://pieter.ai/resources/buble-logo.png" alt="BübleLM Logo" style="max-height: 450px; width: auto;"/> |
|
<h1 style="margin-top: 1rem;">BübleLM</h1> |
|
<p><em>A small German LM</em></p> |
|
</div> |
|
|
|
BübleLM is a German language model based on Gemma-2-2B, adapted using [trans-tokenization](https://pieter.ai/trans-tokenization/) with a custom German SentencePiece tokenizer. The model demonstrates how language-specific tokenization can significantly improve performance while maintaining the base model's capabilities. |
|
|
|
## Model Details |
|
|
|
- **Architecture**: Based on Gemma-2B decoder-only architecture |
|
- **Parameters**: 2 billion |
|
- **Tokenizer**: Custom German SentencePiece tokenizer (20k vocabulary) |
|
- Fertility rate: 1.78 tokens per word |
|
- Optimized for German morphological structures |
|
- Trained on the same corpus as the model |
|
- **Context Length**: 8192 tokens |
|
- **Training Hardware**: Single node with 4x NVidia A100-SXM4-80GB GPUs |
|
|
|
## Training Data |
|
|
|
Trained on 3.5B tokens from Occiglot-FineWeb project, including: |
|
- Contemporary web content (OSCAR 2015-2023) |
|
- Legislative documents (EurLex, ParlamInt) |
|
- News data (Tagesschau) |
|
- Wiki sources |
|
|
|
Data sampling weights: |
|
- Wikipedia: 4x |
|
- News/Parliamentary: 2x |
|
- Other sources: 1x |
|
|
|
## Performance |
|
|
|
Key improvements over Gemma-2-2B baseline: |
|
- HellaSwag-DE: +71% (47.9% vs 28.0%) |
|
- ARC-DE: +41% (32.3% vs 22.9%) |
|
- Average zero-shot: +40% (35.8% vs 25.5%) |
|
|
|
→ BübleLM-2B consistently outperforms both the base Gemma-2-2B and other German models like LLäMmlein-1B across most tasks. |
|
|
|
<table class="model-comparison"> |
|
<thead> |
|
<tr> |
|
<th align="left">Model</th> |
|
<th align="center" colspan="2">ARC-DE</th> |
|
<th align="center" colspan="2">HellaSwag-DE</th> |
|
<th align="center">TruthfulQA-DE</th> |
|
<th align="center">Average</th> |
|
</tr> |
|
<tr> |
|
<th></th> |
|
<th align="center">0-shot</th> |
|
<th align="center">3-shot</th> |
|
<th align="center">0-shot</th> |
|
<th align="center">3-shot</th> |
|
<th align="center">0-shot</th> |
|
<th align="center">0-shot</th> |
|
</tr> |
|
</thead> |
|
<tbody> |
|
<tr> |
|
<td><a href="https://huggingface.co/google/gemma-2-2b" target="_blank">Gemma-2-2B</a></td> |
|
<td align="center">22.9</td> |
|
<td align="center">23.1</td> |
|
<td align="center">28.0</td> |
|
<td align="center">27.6</td> |
|
<td align="center">25.5</td> |
|
<td align="center">25.5</td> |
|
</tr> |
|
<tr> |
|
<td><a href="https://huggingface.co/LSX-UniWue/LLaMmlein_120M" target="_blank">LLäMmlein-120M</a></td> |
|
<td align="center">24.7 ↑+8%</td> |
|
<td align="center">-</td> |
|
<td align="center">32.0 ↑+14%</td> |
|
<td align="center">-</td> |
|
<td align="center">25.0 ↓-2%</td> |
|
<td align="center">27.2 ↑+7%</td> |
|
</tr> |
|
<tr> |
|
<td><a href="https://huggingface.co/LSX-UniWue/LLaMmlein_1B" target="_blank">LLäMmlein-1B</a></td> |
|
<td align="center">30.0 ↑+31%</td> |
|
<td align="center">-</td> |
|
<td align="center"><strong>48.5</strong> ↑+73%</td> |
|
<td align="center">-</td> |
|
<td align="center">23.4 ↓-8%</td> |
|
<td align="center">34.0 ↑+33%</td> |
|
</tr> |
|
<tr> |
|
<td><a href="https://huggingface.co/VAGOsolutions/SauerkrautLM-Gemma-2b" target="_blank">Sauerkraut-Gemma-2B</a></td> |
|
<td align="center">28.0 ↑+22%</td> |
|
<td align="center">34.6 ↑+50%</td> |
|
<td align="center">37.2 ↑+33%</td> |
|
<td align="center">44.1 ↑+60%</td> |
|
<td align="center"><strong>32.9</strong> ↑+29%</td> |
|
<td align="center">32.7 ↑+28%</td> |
|
</tr> |
|
<tr> |
|
<td><strong>BübleLM (Ours)</strong></td> |
|
<td align="center"><strong>32.3</strong> ↑+41%</td> |
|
<td align="center"><strong>35.2</strong> ↑+52%</td> |
|
<td align="center">47.9 ↑+71%</td> |
|
<td align="center"><strong>46.6</strong> ↑+69%</td> |
|
<td align="center">27.2 ↑+7%</td> |
|
<td align="center"><strong>35.8</strong> ↑+40%</td> |
|
</tr> |
|
</tbody> |
|
</table> |
|
|
|
*Performance evaluated on German versions of ARC (knowledge-based QA), HellaSwag (commonsense reasoning), and TruthfulQA (truthfulness). Values show accuracy in percentages, with arrows indicating relative improvement over Gemma-2B baseline. Best results shown in bold.* |
|
|
|
## Safety & Ethics |
|
|
|
### Toxicity |
|
- Perplexity: 52.97 on German TextDetox dataset |
|
- Toxic content appears more out-of-distribution compared to baseline |
|
|
|
### Gender Bias |
|
- Evaluated using perplexity differences between traditional and gender-inclusive forms |
|
- Slight preference for gender-inclusive language (not statistically significant) |
|
- Example: "Lehrer" vs "Lehrer*innen" (∆PPL = -9.61) |
|
|
|
|
|
## Usage |
|
|
|
**Note**: This is a base language model, not an instruction-tuned model. It is not optimized for chat or instruction following. For best results, use standard text completion rather than chat templates. |
|
|
|
Also make sure you have the sentencepiece tokenizer installed: |
|
|
|
```bash |
|
pip install sentencepiece |
|
``` |
|
|
|
```python |
|
from transformers import pipeline |
|
pipe = pipeline("text-generation", model="flair/bueble-lm-2b") |
|
pipe("Ich bin") |
|
``` |
|
|
|
Or with the full model api: |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("flair/bueble-lm-2b") |
|
model = AutoModelForCausalLM.from_pretrained( |
|
"flair/bueble-lm-2b", |
|
device_map="auto", |
|
torch_dtype=torch.bfloat16 |
|
) |
|
|
|
# Basic text completion |
|
text = "Berlin ist eine Stadt, die" |
|
inputs = tokenizer(text, return_tensors="pt").to("cuda") |
|
outputs = model.generate(**inputs, max_new_tokens=256) |
|
print(tokenizer.decode(outputs[0])) |
|
``` |
|
|
|
For instruction-tuning experiments or chat applications, we recommend fine-tuning the model first with appropriate German instruction datasets. |
|
|
|
|
|
## Limitations |
|
|
|
- Limited vocabulary size (20k tokens) compared to multilingual models (250k for Gemma) |
|
- Performance may vary on specialized domains not well-represented in training data |
|
- Higher fertility rate (1.78) due to smaller vocabulary size |
|
- Inherits base limitations from Gemma architecture |
|
|
|
## Citation |
|
|
|
```bibtex |
|
@article{delobelle2024buble, |
|
title={BübleLM: A small German LM}, |
|
author={Delobelle, Pieter and Akbik, Alan and others}, |
|
year={2024} |
|
} |
|
``` |