---
language:
- de
tags:
- german
- causal-lm
- text-generation
library_name: transformers
pipeline_tag: text-generation
license: apache-2.0
---
# BübleLM
BübleLM
A small German LM
BübleLM is a German language model based on Gemma-2B, adapted using [trans-tokenization](https://pieter.ai/trans-tokenization/) with a custom German SentencePiece tokenizer. The model demonstrates how language-specific tokenization can significantly improve performance while maintaining the base model's capabilities.
## Model Details
- **Architecture**: Based on Gemma-2B decoder-only architecture
- **Parameters**: 2 billion
- **Tokenizer**: Custom German SentencePiece tokenizer (20k vocabulary)
- Fertility rate: 1.78 tokens per word
- Optimized for German morphological structures
- Trained on the same corpus as the model
- **Context Length**: 8192 tokens
- **Training Hardware**: Single node with 4x NVidia A100-SXM4-80GB GPUs
## Training Data
Trained on 3.5B tokens from Occiglot-FineWeb project, including:
- Contemporary web content (OSCAR 2015-2023)
- Legislative documents (EurLex, ParlamInt)
- News data (Tagesschau)
- Wiki sources
Data sampling weights:
- Wikipedia: 4x
- News/Parliamentary: 2x
- Other sources: 1x
## Performance
[INSERT FIGURE: Performance comparison across models]
Key improvements over Gemma-2B baseline:
- HellaSwag-DE: +71% (47.9% vs 28.0%)
- ARC-DE: +41% (32.3% vs 22.9%)
- Average zero-shot: +40% (35.8% vs 25.5%)
Consistently outperforms both the base Gemma-2B and other German models like LLaMmlein-1B across most tasks.
Model |
ARC-DE |
HellaSwag-DE |
TruthfulQA-DE |
Average |
|
0-shot |
3-shot |
0-shot |
3-shot |
0-shot |
0-shot |
Gemma-2-2B |
22.9 |
23.1 |
28.0 |
27.6 |
25.5 |
25.5 |
LLaMmlein-120M |
24.7 ↑+8% |
- |
32.0 ↑+14% |
- |
25.0 ↓-2% |
27.2 ↑+7% |
LLaMmlein-1B |
30.0 ↑+31% |
- |
48.5 ↑+73% |
- |
23.4 ↓-8% |
34.0 ↑+33% |
Sauerkraut-Gemma-2B |
28.0 ↑+22% |
34.6 ↑+50% |
37.2 ↑+33% |
44.1 ↑+60% |
32.9 ↑+29% |
32.7 ↑+28% |
BübleLM (Ours) |
32.3 ↑+41% |
35.2 ↑+52% |
47.9 ↑+71% |
46.6 ↑+69% |
27.2 ↑+7% |
35.8 ↑+40% |
*Performance evaluated on German versions of ARC (knowledge-based QA), HellaSwag (commonsense reasoning), and TruthfulQA (truthfulness). Values show accuracy in percentages, with arrows indicating relative improvement over Gemma-2B baseline. Best results shown in bold.*
## Safety & Ethics
### Toxicity
- Score: 52.97 on German TextDetox dataset
- Toxic content appears more out-of-distribution compared to baseline
### Gender Bias
- Evaluated using perplexity differences between traditional and gender-inclusive forms
- Slight preference for gender-inclusive language (not statistically significant)
- Example: "Lehrer" vs "Lehrer*innen" (∆PPL = -9.61)
## Usage
**Note**: This is a base language model, not an instruction-tuned model. It is not optimized for chat or instruction following. For best results, use standard text completion rather than chat templates.
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("flair/bueble-lm-2b")
model = AutoModelForCausalLM.from_pretrained(
"flair/bueble-lm-2b",
device_map="auto",
torch_dtype=torch.bfloat16
)
# Basic text completion
text = "Berlin ist eine Stadt, die"
inputs = tokenizer(text, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0]))
```
For instruction-tuning experiments or chat applications, we recommend fine-tuning the model first with appropriate German instruction datasets.
## Limitations
- Limited vocabulary size (20k tokens) compared to multilingual models (250k for Gemma)
- Performance may vary on specialized domains not well-represented in training data
- Higher fertility rate (1.78) due to smaller vocabulary size
- Inherits base limitations from Gemma architecture
## Citation
```bibtex
@article{delobelle2024buble,
title={BübleLM: A small German LM},
author={Delobelle, Pieter and Akbik, Alan and others},
year={2024}
}
```