File size: 14,534 Bytes
03e199f
 
ac10ea2
 
2c70e02
5a894c0
 
 
03e199f
5a894c0
03e199f
5a894c0
 
 
 
 
 
6773024
03e199f
 
ac10ea2
5a894c0
 
 
ac10ea2
5a894c0
 
 
 
 
 
 
ac10ea2
 
 
5a894c0
 
 
 
 
 
 
 
 
f7d4f88
ac10ea2
5a894c0
b9be0c4
 
 
 
ac10ea2
b9be0c4
9a05bb3
b9be0c4
ac10ea2
 
b9be0c4
 
 
03e199f
 
 
 
 
 
 
 
ac10ea2
 
03e199f
 
 
 
 
 
 
 
ac10ea2
03e199f
ac10ea2
03e199f
 
ac10ea2
03e199f
 
 
 
 
 
 
 
 
5a894c0
 
 
 
 
 
 
 
 
 
 
ac10ea2
5a894c0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ced8fa5
5a894c0
ac10ea2
5a894c0
 
 
 
 
 
 
 
 
 
 
 
ac10ea2
5a894c0
5a55a74
c4760b4
5a55a74
 
 
 
 
 
5a894c0
ac10ea2
6773024
ac10ea2
6d02c74
ac10ea2
6d02c74
 
7d13d5d
 
5a894c0
 
 
 
ac10ea2
5a894c0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ac10ea2
5a894c0
 
ac10ea2
937469f
 
 
6773024
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
---
base_model:
- avemio/German-RAG-PHI-3.5-MINI-4B-SFT-HESSIAN-AI
- avemio/German-RAG-PHI-3.5-MINI-4B-ORPO-HESSIAN-AI
base_model_relation: merge
language:
- en
- de
library_name: transformers
pipeline_tag: question-answering
tags:
- German
- RAG
- Retrieval
- Question-Answering
- Summarization
- Reasoning
license: mit
---

# German-RAG-PHI-3.5-MINI-4B-MERGED-HESSIAN-AI

<!-- Provide a quick summary of what the model is/does. -->

**German-RAG** (**G**erman **R**etrieval **A**ugmented **G**eneration) models are designed for the German-speaking market, enabling innovation and AI solutions to drive German research collaboration in business-focused Generative AI by 2025


## Model Details

The core models released in this batch are the following: 
| Size | Training Tokens | 
|------|--------|
| [German-RAG-PHI-CPT](https://huggingface.co/avemio/German-RAG-PHI-3.5-MINI-4B-CPT-HESSIAN-AI)   | 507.47 million |
| [German-RAG-PHI-SFT](https://huggingface.co/avemio/German-RAG-PHI-3.5-MINI-4B-SFT-HESSIAN-AI) |  2.03 billion  |  
| [German-RAG-PHI-ORPO](https://huggingface.co/avemio/German-RAG-PHI-3.5-MINI-4B-ORPO-HESSIAN-AI) |  2.0577 billion  | 

### Model Description

<!-- Provide a longer summary of what this model is. -->

- **Developed by:** Avemio AI Team
- **Supported by:** Hessian AI
- **Model type:** a Transformer style autoregressive language model.
- **Language(s) (NLP):** German, English
- **License:** The code and model are released under MIT.
- **Contact:** [German-RAG@avemio.digital](mailto:German-RAG@avemio.digital)

### Model Sources

<!-- Provide the basic links for the model. -->

- **Training Study:** [Training Study](https://avemio.digital/wp-content/uploads/2025/01/German-RAG-TRAINING-STUDY-Advancing-German-Language-AI-with-hessian-AI.pdf)
- **Repositories:** 
    - Training: [Colab-Notebook](https://colab.research.google.com/drive/18SH_aYLCnw1K7cRGOTTZ80y98V5Kquxb?usp=sharing)
    - Evaluation code: 
        - [German-RAG-LLM-HARD-BENCHMARK](https://github.com/avemio-digital/German-RAG-LLM-HARD-BENCHMARK.git)
        - [German-RAG-LLM-EASY-BENCHMARK](https://github.com/avemio-digital/German-RAG-LLM-EASY-BENCHMARK.git)
-  **Technical blog post:**
<!-- - **Press release:** TODO -->

## Merge Details
### Merge Method

This model was merged using the SLERP merge method.

### Models Merged

The following models were included in the merge:
* [avemio/German-RAG-PHI-3.5-MINI-4B-SFT-HESSIAN-AI](https://huggingface.co/avemio/German-RAG-PHI-3.5-MINI-4B-SFT-HESSIAN-AI)
* [avemio/German-RAG-PHI-3.5-MINI-4B-ORPO-HESSIAN-AI](https://huggingface.co/avemio/German-RAG-PHI-3.5-MINI-4B-ORPO-HESSIAN-AI)

### Configuration

The following YAML configuration was used to produce this model:

```yaml
slices:
  - sources:
      - model: avemio/German-RAG-PHI-3.5-MINI-4B-SFT-HESSIAN-AI
        layer_range: [0, 32]
      - model: avemio/German-RAG-PHI-3.5-MINI-4B-ORPO-HESSIAN-AI
        layer_range: [0, 32]
merge_method: slerp
base_model: avemio/German-RAG-PHI-3.5-MINI-4B-ORPO-HESSIAN-AI
parameters:
  t:
    - filter: self_attn
      value: [0, 0.5, 0.3, 0.7, 1]
    - filter: mlp
      value: [1, 0.5, 0.7, 0.3, 0]
    - value: 0.5
dtype: bfloat16
```

## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

### Inference
Quickly get inference running with the following required installation:
Now, proceed as usual with HuggingFace:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
 
model_name = "avemio/German-RAG-PHI-3.5-MINI-4B-MERGED-HESSIAN-AI"
 
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
im_end_token_id = tokenizer.convert_tokens_to_ids('<|im_end|>')
im_start_token_id = tokenizer.convert_tokens_to_ids('<|im_start|>')
 
messages = [
    {"role": "system", "content": "Folge den Anweisungen des Benutzers. Bevor du deine finale Antwort gibst, schildere deine Überlegungen zur Lösung des Problems."},
    {"role": "user", "content": "Ferdinand steht vor der Herausforderung, eine faire Besuchsregelung für seine drei Kinder zu finden, die den Bedürfnissen jedes einzelnen Kindes gerecht wird. Jedes Kind hat unterschiedliche Vorlieben und Bedürfnisse, die in den Besuchsplan integriert werden müssen. Er muss sicherstellen, dass die Regelung sowohl den Interessen der Kinder als auch den rechtlichen Vorgaben entspricht. Ferdinand hat eine Woche Zeit, um einen Vorschlag zu erarbeiten, den er mit seinem Anwalt besprechen kann."}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=False
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
 
generated_ids = model.generate(
    **model_inputs,
    max_length=2024,
    temperature=0.01,
    do_sample=False,
    #bos_token_id=im_start_token_id,
    eos_token_id=im_end_token_id,
    pad_token_id=tokenizer.eos_token_id,
    repetition_penalty=1.1,
    num_return_sequences=1,
    top_k=40,
    top_p=0.95,
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
 
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
 
```


### Fine-tuning
We are providing a comprehensive Google Colab notebook to guide users through the process of fine-tuning our model, complete with detailed instructions, essential dependencies, and configurable settings.
 [Colab-Notebook](https://colab.research.google.com/drive/18SH_aYLCnw1K7cRGOTTZ80y98V5Kquxb?usp=sharing).

## German-RAG-LLM-EASY-BENCHMARK EVAL

<!-- This section describes the evaluation protocols and provides the results. -->
The evaluation was performed using seven subsets, focusing on extraction recall, question answering (QA) with multiple references, and time difference reasoning. Relevant context and summarization were treated as distinct subsets, each playing a crucial role in the evaluation process. For relevant context, the model's ability to identify and extract pertinent information from the source material was assessed. In contrast, the summarization subset evaluated the model's capability to generate concise and accurate summaries based on the relevant context.

Four evaluation metrics were employed across all subsets: language quality, overall correctness, instruction following, and an overall score.

-   **Language quality:** This metric focused on the overall linguistic quality of the outputs, considering factors such as grammar, fluency, and clarity.
-   **Overall correctness:** The accuracy and correctness of the content were evaluated under this metric.
-   **Instruction following:** This metric assessed the model's ability to follow specific instructions provided for each task.
-   **Overall score:** This metric combined the results from the previous three metrics, offering a comprehensive evaluation of the model's capabilities across all subsets.


| Metric                                    | [Vanilla-Phi-3.5-Mini-4B](https://huggingface.co/ThomasComics/Phi-3-mini-128k-instruct-LLaMAfied) | [German-RAG-PHI-SFT](https://huggingface.co/avemio/German-RAG-PHI-3.5-MINI-4B-SFT-HESSIAN-AI) | [German-RAG-PHI-ORPO](https://huggingface.co/avemio/German-RAG-PHI-3.5-MINI-4B-ORPO-HESSIAN-AI) | **German-RAG-PHI-MERGED** | *GPT-3.5-TURBO* | 
|------------------------------------------|---------------------------------------------------------------------------------|--------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------|-----------------------------|----------------|
| Average Language Quality             | 75.11                                                                          | 78.88                                                                          | 78.13                                                                                          |**85.41**                             |*91.86*                |
| **OVERALL SCORES (weighted):**       |                                                                            |                                                                            |                                                                                           |                             |               |
| extraction_recall       | 18.0                                                                           | 37.5                                                                           | 32.0                                                                                          |**61.8**                             |*87.2*                |
| qa_multiple_references | 65.8                                                                           | 70.6                                                                           | 74.8                                                                                          |**84.8**                             |*77.2*                |
| qa_without_time_difference | 71.2                                                                           | 88.0                                                                           | 87.3                                                                                          |**88.0**                             |*83.1*                |
| qa_with_time_difference | 64.6                                                                           | 89.3                                                                           | 86.9                                                                                          |**89.1**                             |*83.2*                |
| relevant_context       | 72.3                                                                           | 72.8                                                                           | 69.1                                                                                          |**84.4**                             |*89.5*                |
| summarizations         | 74.6                                                                           | 83.2                                                                           | 81.1                                                                                          |**84.9**                             |*86.9*                |

## German-RAG-LLM-HARD-BENCHMARK EVAL

<img src="https://avemio.digital/wp-content/uploads/2025/01/German-RAG-PHI-Merged.png" alt="German-RAG Logo" width="700" style="margin-left:'auto' margin-right:'auto' display:'block'"/>

| Metric                  | [Vanila-PHI-4B-Instruct](https://huggingface.co/ThomasComics/Phi-3-mini-128k-instruct-LLaMAfied) | **[German-RAG-PHI-Merged](https://huggingface.co/avemio/German-RAG-PHI-3.5-MINI-4B-MERGED-HESSIAN-AI)**  | GPT-3.5-TURBO | GPT-4o | GPT-4o-mini |
|-------------------------|-----------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------|----------------|---------|-------------|
| **OVERALL SCORES (weighted):** |                                                                                 |                                                                                       |                |         |             |
| hard_reasoning_de       | 42.8                                                                                    | **41.8**                                                                                   | 37.9           | 62.9    | 58.4        |
| hard_reasoning_en       | 50.8                                                                                    | **55.9**                                                                                   | 48.3           | 61.7    | 62.9        |

### Architecture


| Parameter            | German-RAG-PHI-MERGED                                                                                  |
|-----------------------|-----------------------------------------------------------------------------------------------|
| **d_model**          | 3072                                                                                          |
| **num heads**        | 32                                                                                            |
| **num layers**       | 32                                                                                            |
| **MLP ratio**        | 2.66                                                                                          |
| **LayerNorm type**   | RMSNorm                                                                                       |
| **pos embeddings**   | RoPE                                                                                          |
| **attention variant**| Standard Multi-Head Self Attention with sliding-window of 2047                                |
| **biases**           | none                                                                                          |
| **block type**       | sequential                                                                                    |
| **activation**       | SiLU                                                                                          |
| **sequence length**  | 131072                                                                                        |
| **weight tying**     | bfloat16  



## Bias, Risks, and Limitations

Like any base language model or fine-tuned model without safety filtering, it is relatively easy for a user to prompt these models to generate harmful and generally sensitive content.
Such content can also be produced unintentionally, especially in the case of bias, so we recommend users consider the risks of applications of this technology.

Otherwise, many facts from German-RAG-PHI-MERGED or any LLM will often not be true, so they should be checked.


## The German-RAG AI Team
[Marcel Rosiak](https://de.linkedin.com/in/marcel-rosiak)
[Soumya Paul](https://de.linkedin.com/in/soumya-paul-1636a68a)
[Siavash Mollaebrahim](https://de.linkedin.com/in/siavash-mollaebrahim-4084b5153?trk=people-guest_people_search-card)
[Zain ul Haq](https://de.linkedin.com/in/zain-ul-haq-31ba35196)