File size: 10,902 Bytes
c9802b9
 
 
 
 
 
638c4cb
 
 
 
 
 
 
c9802b9
638c4cb
 
 
 
 
 
 
 
 
c9802b9
 
638c4cb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c9802b9
638c4cb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
---
language:
- nl
license: llama3
---

<p align="center" style="margin:0;padding:0">
<img src="./chocollama_logo.png" alt="ChocoLlama logo" width="500" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
</p>
<div style="margin:auto; text-align:center">
<h1 style="margin-bottom: 0">ChocoLlama</h1>
<em>A Llama-2/3-based family of Dutch language models</em>
</div>

## Llama-3-ChocoLlama-8B-base: Getting Started

We here present **Llama-3-ChocoLlama-8B-base**, a language-adapted version of Meta's Llama-3-8b, fine-tuned on 17B Dutch Llama-3 tokens (104GB) using LoRa.
Note that this is a base model, not optimized for conversational behavior. 
If this is desired for your use-case, we recommend finetuning this model on your own Dutch data or using the instruction-finetuned version of this model, [Llama-3-ChocoLlama-instruct](https://huggingface.co/ChocoLlama/Llama-3-ChocoLlama-instruct). 

Use the code below to get started with the model.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('ChocoLlama/Llama-3-ChocoLlama-base')
model = AutoModelForCausalLM.from_pretrained('ChocoLlama/Llama-3-ChocoLlama-base')
```

## Model Details

ChocoLlama is a family of open LLM's specifically adapted to Dutch, contributing to the state-of-the-art of Dutch open LLM's in their weight class.

We provide 6 variants (of which 3 base and 3 instruction-tuned models):
- **ChocoLlama-2-7B-base** ([link](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-base)): A language-adapted version of Meta's Llama-2-7b, fine-tuned on 32B Dutch Llama-2 tokens (104GB) using LoRa.
- **ChocoLlama-2-7B-instruct** ([link](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-instruct)): An instruction-tuned version of ChocoLlama-2-7B-base, fine-tuned on a collection of Dutch translations of instruction-tuning datasets, using SFT followed by DPO.
- **ChocoLlama-2-7B-tokentrans-base** ([link](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-tokentrans-base)): A language-adapted version of Meta's Llama-2-7b, using a Dutch RoBERTa-based tokenizer. The token embeddings of this model were reinitialized using the token translation algorithm proposed by [Remy et al.](https://arxiv.org/pdf/2310.03477). The model was subsequently fine-tuned on the same Dutch dataset as ChocoLlama-2-7B-base, again using LoRa.
- **ChocoLlama-2-7B-tokentrans-instruct** ([link](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-tokentrans-instruct)): An instruction-tuned version of ChocoLlama-2-7B-tokentrans-base, fine-tuned on the same dataset as ChocoLlama-2-7B-instruct, again using SFT followed by DPO.
- **Llama-3-ChocoLlama-8B-base** ([link](https://huggingface.co/ChocoLlama/Llama-3-ChocoLlama-8B-base)): A language-adapted version of Meta's Llama-8-8B, fine-tuned on the same Dutch dataset as ChocoLlama-2-7B-base, again using LoRa.
- **Llama-3-ChocoLlama-instruct** ([link](https://huggingface.co/ChocoLlama/Llama-3-ChocoLlama-8B-instruct)): An instruction-tuned version of Llama-3-ChocoLlama-8B-base, fine-tuned on the same dataset as ChocoLlama-2-7B-instruct, again using SFT followed by DPO.

For benchmark results for all models, including compared to their base models and other Dutch LLMs, we refer to our paper [here](some_url).  

### Model Description

- **Developed by:** [Matthieu Meeus](https://huggingface.co/matthieumeeus97), [Anthony Rathé](https://huggingface.co/anthonyrathe)
- **Funded by:** [Vlaams Supercomputer Centrum](https://www.vscentrum.be/), through a grant of apx. 40K GPU hours (NVIDIA A100-80GB)
- **Language(s):** Dutch
- **License:** [Llama-3 Community License](https://www.llama.com/llama3/license/)
- **Finetuned from model:** [Llama-3-8b](https://huggingface.co/meta-llama/Meta-Llama-3-8B)

### Model Sources

- **Repository:** Will be released soon.
- **Paper:** Will be released soon.

## Uses

### Direct Use

Since this is a base model, we do not recommend using it for your use-cases directly. We instead recommend:
1. Fine-tuning this model to your specific use-case
2. Leveraging the instruction-tuned version of this model

### Downstream Use

Since this model is a base model, it can easily be adapted to specific use-cases that required Dutch language understanding and generation. 
We expect this model to be particularly useful for use-cases in the domains which were explicitly covered in our dataset, e.g. the analysis and/or generation of Dutch job descriptions, corporate filings and legislation.

### Out-of-Scope Use

- Use-cases requiring a chat-style interface: since this is a base model, it cannot be used reliably for turn-based chat interaction. Please refer to the instruction-tuned version of this model instead.
- Use-cases requiring understanding or generation of text in languages other than Dutch: the dataset on which this model was fine-tuned does not contain data in languages other than Dutch, hence we expect significant catastrophic forgetting to have occured for English, which is the language Llama-2 was originally trained for.

## Bias, Risks, and Limitations

We have taken care to include only widely used and high-quality data in our dataset. Some of this data has been filtered by the original creators.
However we did not explicitly conduct any additional filtering of this dataset with regards to biased or otherwise harmful content.

### Recommendations

We recommend fine-tuning this model to your curated data to maximally avoid undesirable outputs.

## Training Details

### Training Data

We collect a diverse set of Dutch natural language. 

1. **OSCAR**  
   The bulk of our data comes from the Dutch portion of [OSCAR](https://oscar-corpus.com), January 2023 version, based on Common Crawl. This dataset includes **93 GB** of text (~28.6B tokens).

2. **Open Subtitles**  
   We collected Dutch text from movie subtitles, focusing on unique movies either in Dutch or with Dutch subtitles. This dataset contains **5 GB** of text (~1.54B tokens) from **214k samples**.

3. **Project Gutenberg**  
   We downloaded **970 full Dutch books** from [Project Gutenberg](https://www.gutenberg.org) using a public scraper. The dataset includes **0.3 GB** of text (~92M tokens) and is available on [Hugging Face](https://huggingface.co/datasets/ChocoLlama/gutenberg-dutch).

4. **Wikipedia**  
   Using the March 2023 [Wikipedia dump](https://dumps.wikimedia.org), we included **2.5 GB** of text (~769M tokens). Despite some duplication with OSCAR, Wikipedia's high quality justifies its inclusion.

5. **Job Descriptions (TechWolf)**  
   A sample of **750k Dutch job descriptions** collected over five years from public websites, provided by TechWolf. This dataset contains **1.5 GB** of text (~462M tokens).

6. **Staatsblad (Bizzy)**  
   A sample of **80k legal filings** from [Het Belgisch Staatsblad](https://www.ejustice.just.fgov.be/cgi/welcome.pl). Documents were OCR-processed, and personal data was excluded. This dataset includes **1.4 GB** of text (~431M tokens), collected with help from Bizzy.

7. **Legislation (ML6)**  
   **15k documents** from Flemish legislation accessed via the [Open Data API](https://www.vlaanderen.be/vlaams-parlement/de-vlaamse-codex). This dataset contains **0.2 GB** of text (~62M tokens), collected with support from ML6.

### Training Procedure

This model was fine-tuned using low-rank (LoRa) adapatation with trainable embeddings, for a total of 1.07B trainable parameters.

#### Training Hyperparameters

- **Training regime:** bf16 non-mixed precision
- **Epochs:** 1
- **LoRa parameters:**
    - R: 8
    - Alpha: 32
    - Trainable modules: q_proj, v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj, embed_tokens, lm_head
    - LoRa dropout: 0.05
- **Learning Rate:**
    - Scheduler: StepLR
    - Step size: 6212
    - Learning rate: 0.0003
    - Gamma: 0.85
- **Other parameters:**
    - Minibatch size: 16
    - Gradient accumulation steps: 8
    - Parallelization factor: 8
    - Weight decay: 0

## Evaluation

### Quantitative evaluation

We have evaluated our models on several industry-standard Dutch benchmarks, translated from their original versions. The results can be found in the table below, together with results from several other prominent Dutch models.

| Model                                        | ARC            | HellaSwag      | MMLU           | TruthfulQA     | Avg.           |
|----------------------------------------------|----------------|----------------|----------------|----------------|----------------|
| **Llama-3-ChocoLlama-instruct**        | **0.48** | **0.66** | **0.49** | **0.49** | **0.53** |
| llama-3-8B-rebatch                           | 0.44           | 0.64           | 0.46           | 0.48           | 0.51           |
| llama-3-8B-instruct                          | 0.47           | 0.59           | 0.47           | 0.52           | 0.51           |
| llama-3-8B                                   | 0.44           | 0.64           | 0.47           | 0.45           | 0.5            |
| Reynaerde-7B-Chat                            | 0.44           | 0.62           | 0.39           | 0.52           | 0.49           |
| **Llama-3-ChocoLlama-base** | **0.45** | **0.64** | **0.44** | **0.44** | **0.49** |
| zephyr-7b-beta                               | 0.43           | 0.58           | 0.43           | 0.53           | 0.49           |
| geitje-7b-ultra                              | 0.40           | 0.66           | 0.36           | 0.49           | 0.48           |
| **ChocoLlama-2-7B-tokentrans-instruct** | **0.45** | **0.62** | **0.34** | **0.42** | **0.46** |
| mistral-7b-v0.1                              | 0.43           | 0.58           | 0.37           | 0.45           | 0.46           |
| **ChocoLlama-2-7B-tokentrans-base** | **0.42** | **0.61** | **0.32** | **0.43** | **0.45** |
| **ChocoLlama-2-7B-instruct** | **0.36** | **0.57** | **0.33** | **0.45** | **0.43 |
| **ChocoLlama-2-7B-base** | **0.35** | **0.56** | **0.31** | **0.43** | **0.41** |
| llama-2-7b-chat-hf                           | 0.36           | 0.49           | 0.33           | 0.44           | 0.41           |
| llama-2-7b-hf                                | 0.36           | 0.51           | 0.32           | 0.41           | 0.40           |

On average, Llama-3-ChocoLlama-instruct surpasses the previous state-of-the-art on these benchmarks.

### Qualitative evaluation

In our paper, we also provide an additional qualitative evaluation of all models - which we empirically find more reliable. 
For details, we refer to the paper and to our benchmark [ChocoLlama-Bench](https://huggingface.co/datasets/ChocoLlama/ChocoLlama-Bench). 

### Compute Infrastructure

All ChocoLlama models have been trained on the compute cluster provided by the [Flemish Supercomputer Center (VSC)](https://www.vscentrum.be/). We used 8 to 16 NVIDIA A100 GPU's with 80 GB of VRAM.