--- language: - de tags: - german - causal-lm - text-generation library_name: transformers pipeline_tag: text-generation license: apache-2.0 --- # BübleLM
BübleLM Logo

BübleLM

A small German LM

BübleLM is a German language model based on Gemma-2B, adapted using [trans-tokenization](https://pieter.ai/trans-tokenization/) with a custom German SentencePiece tokenizer. The model demonstrates how language-specific tokenization can significantly improve performance while maintaining the base model's capabilities. ## Model Details - **Architecture**: Based on Gemma-2B decoder-only architecture - **Parameters**: 2 billion - **Tokenizer**: Custom German SentencePiece tokenizer (20k vocabulary) - Fertility rate: 1.78 tokens per word - Optimized for German morphological structures - Trained on the same corpus as the model - **Context Length**: 8192 tokens - **Training Hardware**: Single node with 4x NVidia A100-SXM4-80GB GPUs ## Training Data Trained on 3.5B tokens from Occiglot-FineWeb project, including: - Contemporary web content (OSCAR 2015-2023) - Legislative documents (EurLex, ParlamInt) - News data (Tagesschau) - Wiki sources Data sampling weights: - Wikipedia: 4x - News/Parliamentary: 2x - Other sources: 1x ## Performance [INSERT FIGURE: Performance comparison across models] Key improvements over Gemma-2B baseline: - HellaSwag-DE: +71% (47.9% vs 28.0%) - ARC-DE: +41% (32.3% vs 22.9%) - Average zero-shot: +40% (35.8% vs 25.5%) Consistently outperforms both the base Gemma-2B and other German models like LLaMmlein-1B across most tasks.
Model ARC-DE HellaSwag-DE TruthfulQA-DE Average
0-shot 3-shot 0-shot 3-shot 0-shot 0-shot
Gemma-2-2B 22.9 23.1 28.0 27.6 25.5 25.5
LLaMmlein-120M 24.7 ↑+8% - 32.0 ↑+14% - 25.0 ↓-2% 27.2 ↑+7%
LLaMmlein-1B 30.0 ↑+31% - 48.5 ↑+73% - 23.4 ↓-8% 34.0 ↑+33%
Sauerkraut-Gemma-2B 28.0 ↑+22% 34.6 ↑+50% 37.2 ↑+33% 44.1 ↑+60% 32.9 ↑+29% 32.7 ↑+28%
BübleLM (Ours) 32.3 ↑+41% 35.2 ↑+52% 47.9 ↑+71% 46.6 ↑+69% 27.2 ↑+7% 35.8 ↑+40%
*Performance evaluated on German versions of ARC (knowledge-based QA), HellaSwag (commonsense reasoning), and TruthfulQA (truthfulness). Values show accuracy in percentages, with arrows indicating relative improvement over Gemma-2B baseline. Best results shown in bold.* ## Safety & Ethics ### Toxicity - Score: 52.97 on German TextDetox dataset - Toxic content appears more out-of-distribution compared to baseline ### Gender Bias - Evaluated using perplexity differences between traditional and gender-inclusive forms - Slight preference for gender-inclusive language (not statistically significant) - Example: "Lehrer" vs "Lehrer*innen" (∆PPL = -9.61) ## Usage **Note**: This is a base language model, not an instruction-tuned model. It is not optimized for chat or instruction following. For best results, use standard text completion rather than chat templates. ```python from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("flair/bueble-lm-2b") model = AutoModelForCausalLM.from_pretrained( "flair/bueble-lm-2b", device_map="auto", torch_dtype=torch.bfloat16 ) # Basic text completion text = "Berlin ist eine Stadt, die" inputs = tokenizer(text, return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_new_tokens=256) print(tokenizer.decode(outputs[0])) ``` For instruction-tuning experiments or chat applications, we recommend fine-tuning the model first with appropriate German instruction datasets. ## Limitations - Limited vocabulary size (20k tokens) compared to multilingual models (250k for Gemma) - Performance may vary on specialized domains not well-represented in training data - Higher fertility rate (1.78) due to smaller vocabulary size - Inherits base limitations from Gemma architecture ## Citation ```bibtex @article{delobelle2024buble, title={BübleLM: A small German LM}, author={Delobelle, Pieter and Akbik, Alan and others}, year={2024} } ```