|
--- |
|
language: |
|
- de |
|
license: apache-2.0 |
|
library_name: transformers |
|
tags: |
|
- deutsch |
|
- german |
|
- seedbox |
|
- mistral |
|
datasets: |
|
- seedboxai/multitask_german_examples_32k |
|
pipeline_tag: text-generation |
|
--- |
|
|
|
![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/645ded34a45b4182d7f5c385/oh7yRzqtRlDtdu8sJoAdV.jpeg) |
|
|
|
|
|
# KafkaLM-7B-German-V0.1 |
|
|
|
**KafkaLM 7b** is a Mistral 7b model - further pre-trained on a large German dataset from Björn Plüster and LAION. [leo-mistral-hessianai-7b](https://huggingface.co/LeoLM/leo-mistral-hessianai-7b) - which was finetuned on an ensemble of popular high-quality open-source instruction sets (translated from English to German). |
|
|
|
KafkaLM 7b is a [Seedbox](https://huggingface.co/seedboxai) project trained by [Dennis Dickmann](https://huggingface.co/doubledsbv). |
|
|
|
**Why Kafka?** |
|
The models are proficient, yet creative, and have some tendencies to linguistically push boundaries 😊 |
|
|
|
|
|
## Model Details |
|
|
|
The purpose of releasing the **KafkaLM series** is to contribute to the German AI community with a set of fine-tuned LLMs that are easy to use in everyday applications across a variety of tasks. |
|
|
|
The main goal was to provide LLMs proficient in German, especially to be used in German-speaking business contexts where English alone is not sufficient. |
|
|
|
|
|
### Dataset |
|
|
|
I used a 8k filtered version of the following [seedboxai/multitask_german_examples_32k](https://huggingface.co/datasets/seedboxai/multitask_german_examples_32k) |
|
|
|
### Prompt Format |
|
|
|
|
|
This model follows the subsequent prompt format: |
|
|
|
``` |
|
<|system|> |
|
Du bist ein freundlicher und hilfsbereiter KI-Assistent. Du beantwortest Fragen faktenorientiert und präzise, ohne dabei relevante Fakten auszulassen.</s> |
|
<|user|> |
|
Welche Möglichkeiten der energetischen Sanierung habe ich neben Solar und Energiespeicher?</s> |
|
<|assistant|> |
|
``` |
|
|
|
### Inference |
|
|
|
Getting started with the model is straightforward |
|
|
|
```python |
|
import transformers |
|
|
|
model_id = "seedboxai/KafkaLM-7B-German-V0.1" |
|
|
|
model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True, trust_remote_code=True) |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
|
tokenizer.padding_side = "right" |
|
tokenizer.pad_token = tokenizer.unk_token |
|
tokenizer.add_eos_token = False |
|
|
|
def generate_prompt(input): |
|
prompt = '' |
|
sys_prompt = "Du bist ein freundlicher und hilfsbereiter KI-Assistent. Du beantwortest Fragen faktenorientiert und präzise, ohne dabei relevante Fakten auszulassen." |
|
|
|
prompt += f"<|system|>\n{sys_prompt.strip()}</s>\n" |
|
prompt += f"<|user|>\n{input.strip()}</s>\n" |
|
prompt += f"<|assistant|>\n" |
|
|
|
return prompt.strip() |
|
|
|
|
|
def evaluate( |
|
input, |
|
temperature=0.7, |
|
top_p=0.95, |
|
top_k=50, |
|
num_beams=3, |
|
max_new_tokens=512, |
|
#max_length=8192, |
|
**kwargs, |
|
): |
|
prompt = generate_prompt(input) |
|
|
|
#print(prompt) |
|
|
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
input_ids = inputs["input_ids"].to(device) |
|
attention_mask=inputs["attention_mask"].to(device) |
|
generation_config = GenerationConfig( |
|
temperature=temperature, |
|
top_p=top_p, |
|
top_k=top_k, |
|
num_beams=num_beams, |
|
no_repeat_ngram_size=3, |
|
do_sample=True, |
|
**kwargs, |
|
) |
|
|
|
with torch.no_grad(): |
|
generation_output = model.generate( |
|
early_stopping=False, |
|
#eos_token_id=tokenizer.eos_token_id, |
|
#pad_token_id=tokenizer.pad_token_id, |
|
input_ids=input_ids, |
|
attention_mask=attention_mask, |
|
generation_config=generation_config, |
|
return_dict_in_generate=True, |
|
output_scores=True, |
|
max_new_tokens=max_new_tokens, |
|
#max_length= max_length |
|
) |
|
s = generation_output.sequences[0] |
|
output = tokenizer.decode(s) |
|
return output #.split("<|assistant|>")[1].strip() |
|
|
|
|
|
print(evaluate("Wer ist eigentlich dieser Kafka?")) |
|
|
|
``` |
|
|
|
## Disclaimer |
|
|
|
The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. |
|
This model should only be used for research purposes. The original Llama2 license and all restrictions of datasets used to train this model apply. |