File size: 2,459 Bytes
ab54f6e
 
cea56e4
 
 
3767886
cea56e4
 
 
3767886
 
 
 
 
53ba6af
3767886
 
 
ab54f6e
 
 
 
97b25ee
ab54f6e
cea56e4
 
3767886
 
 
 
 
 
 
 
 
 
 
 
 
ab54f6e
cea56e4
ab54f6e
3767886
 
cea56e4
 
3767886
cea56e4
 
 
3767886
ab54f6e
 
 
853042c
cea56e4
06607c1
 
 
853042c
cea56e4
853042c
 
 
 
cea56e4
 
853042c
 
cea56e4
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
---
license: llama3
language:
- gsw
datasets:
- cis-lmu/Glot500
- cis-lmu/GlotCC-V1
pipeline_tag: text-generation
base_model: NousResearch/Hermes-2-Pro-Llama-3-8B
model_type: LlamaForCausalLM
tags:
- Llama-3
- instruct
- finetune
- qlora
- chatml
- synthetic data
- axolotl
---

# Alpesteibock-Llama-3-8B-Alpha

**Alpesteibock-Llama-3-8B-Alpha** is an experimental QLoRA fine-tune of [NousResearch/Hermes-2-Pro-Llama-3-8B](https://huggingface.co/NousResearch/Hermes-2-Pro-Llama-3-8B) on a dataset of 34.7 million tokens of Swiss German text from multiple sources for two epochs.

## License

This model is released under the [Llama 3 Community License](https://llama.meta.com/llama3/license/).

## Usage

The model uses ChatML as an instruction template and was trained using "You are Alpesteibock, a helpful assistant who speaks Swiss German." as a system message:
```
<|im_start|>system
You are Alpesteibock, a helpful assistant who speaks Swiss German.<|im_end|>
<|im_start|>user
Hoi. Wie heissisch du?<|im_end|>
<|im_start|>assistant
Ich bi de Alpesteibock und ich freu mi uf di.<|im_end|>
```

## Dataset

The dataset used for training consists of the following sources:

| Dataset | File Size | Description | Phase |
|---------|-----------|-------------|-------|
| [Glot500 Corpus](https://huggingface.co/datasets/cis-lmu/Glot500) (gsw_Latn, Leipzig_web) | 21.7 MB | Text, usually sentences, crawled from the web | 1 |
| [Alemannic Wikipedia](https://dumps.wikimedia.org/alswiki/) (Subset) | 50.5 MB | Articles in the Alemannic Wikipedia with most of those written in Alsatian filtered out | 2 |
| [Schweizerdeutscher Mundartkorpus](https://chmk.ch/) (Copyright Free Subset) | 28.4 MB | Copyright free books written in Swiss German | 2 |
| [GlotCC-V1.0](https://huggingface.co/datasets/cis-lmu/GlotCC-V1) (gsw-Latn) | 7.5 MB | Document-level general domain monolingual dataset derived from CommonCrawl | 2 |
| Synthetic Instruction Data | 1.7 MB | Different datasets of synthetically generated Swiss German text | 2 |

## Training Details

Hardware: 1x RTX 4090  
Duration: 40 hours in total (2 hours for first phase and 38 hours for second phase)  

### Hyperparameters

Adapter: QLoRA  
Precision: 4-bit  
Optimizer: adamw_bnb_8bit  
LoRA Rank: 256  
LoRA Alpha: 256  
Learning Rate: 1e-5  
Scheduler: Cosine  
Context Length: 4096  
Batch Size: 1  
Gradient Accumulation Steps: 1  
Sample Packing: On for first phase, Off for second phase  
Epochs: 2