--- language: - de tags: - german - causal-lm - text-generation library_name: transformers pipeline_tag: text-generation license: apache-2.0 --- # BübleLM

BübleLM

A small German LM

BübleLM is a German language model based on Gemma-2B, adapted using [trans-tokenization](https://pieter.ai/trans-tokenization/) with a custom German SentencePiece tokenizer. The model demonstrates how language-specific tokenization can significantly improve performance while maintaining the base model's capabilities. ## Model Details - **Architecture**: Based on Gemma-2B decoder-only architecture - **Parameters**: 2 billion - **Tokenizer**: Custom German SentencePiece tokenizer (20k vocabulary) - Fertility rate: 1.78 tokens per word - Optimized for German morphological structures - Trained on the same corpus as the model - **Context Length**: 8192 tokens - **Training Hardware**: Single node with 4x NVidia A100-SXM4-80GB GPUs ## Training Data Trained on 3.5B tokens from Occiglot-FineWeb project, including: - Contemporary web content (OSCAR 2015-2023) - Legislative documents (EurLex, ParlamInt) - News data (Tagesschau) - Wiki sources Data sampling weights: - Wikipedia: 4x - News/Parliamentary: 2x - Other sources: 1x ## Performance [INSERT FIGURE: Performance comparison across models] Key improvements over Gemma-2B baseline: - HellaSwag-DE: +71% (47.9% vs 28.0%) - ARC-DE: +41% (32.3% vs 22.9%) - Average zero-shot: +40% (35.8% vs 25.5%) Consistently outperforms both the base Gemma-2B and other German models like LLaMmlein-1B across most tasks.

Model	ARC-DE		HellaSwag-DE		TruthfulQA-DE	Average
	0-shot	3-shot	0-shot	3-shot	0-shot	0-shot
Gemma-2-2B	22.9	23.1	28.0	27.6	25.5	25.5
LLaMmlein-120M	24.7 ↑+8%	-	32.0 ↑+14%	-	25.0 ↓-2%	27.2 ↑+7%
LLaMmlein-1B	30.0 ↑+31%	-	48.5 ↑+73%	-	23.4 ↓-8%	34.0 ↑+33%
Sauerkraut-Gemma-2B	28.0 ↑+22%	34.6 ↑+50%	37.2 ↑+33%	44.1 ↑+60%	32.9 ↑+29%	32.7 ↑+28%
BübleLM (Ours)	32.3 ↑+41%	35.2 ↑+52%	47.9 ↑+71%	46.6 ↑+69%	27.2 ↑+7%	35.8 ↑+40%

*Performance evaluated on German versions of ARC (knowledge-based QA), HellaSwag (commonsense reasoning), and TruthfulQA (truthfulness). Values show accuracy in percentages, with arrows indicating relative improvement over Gemma-2B baseline. Best results shown in bold.* ## Safety & Ethics ### Toxicity - Score: 52.97 on German TextDetox dataset - Toxic content appears more out-of-distribution compared to baseline ### Gender Bias - Evaluated using perplexity differences between traditional and gender-inclusive forms - Slight preference for gender-inclusive language (not statistically significant) - Example: "Lehrer" vs "Lehrer*innen" (∆PPL = -9.61) ## Usage **Note**: This is a base language model, not an instruction-tuned model. It is not optimized for chat or instruction following. For best results, use standard text completion rather than chat templates. ```python from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("flair/bueble-lm-2b") model = AutoModelForCausalLM.from_pretrained( "flair/bueble-lm-2b", device_map="auto", torch_dtype=torch.bfloat16 ) # Basic text completion text = "Berlin ist eine Stadt, die" inputs = tokenizer(text, return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_new_tokens=256) print(tokenizer.decode(outputs[0])) ``` For instruction-tuning experiments or chat applications, we recommend fine-tuning the model first with appropriate German instruction datasets. ## Limitations - Limited vocabulary size (20k tokens) compared to multilingual models (250k for Gemma) - Performance may vary on specialized domains not well-represented in training data - Higher fertility rate (1.78) due to smaller vocabulary size - Inherits base limitations from Gemma architecture ## Citation ```bibtex @article{delobelle2024buble, title={BübleLM: A small German LM}, author={Delobelle, Pieter and Akbik, Alan and others}, year={2024} } ```