File size: 8,161 Bytes
6515c06 1eb33c4 f8245fb 6515c06 1eb33c4 857ad3e 4e2f65a 857ad3e 888a074 1eb33c4 857ad3e 1eb33c4 857ad3e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 |
---
language:
- nl
license: llama2
---
<p align="center" style="margin:0;padding:0">
<img src="./chocollama_logo.png" alt="ChocoLlama logo" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
</p>
<div style="margin:auto; text-align:center">
<h1 style="margin-bottom: 0">ChocoLlama</h1>
<em>A Llama-2/3-based family of Dutch language models</em>
</div>
## Model Details
ChocoLlama is a family of open LLM's specifically adapted to Dutch, contributing to the state-of-the-art of Dutch open LLM's in their weight class.
We provide 6 variants (of which 3 base and 3 instruction-tuned models):
- **ChocoLlama-2-7B-base**: A language-adapted version of Meta's Llama-2-7b, fine-tuned on a Dutch dataset of 104GB (XXX tokens) using LoRa.
- **ChocoLlama-2-7B-instruct**: An instruction-tuned version of ChocoLlama-2-7B-base, fine-tuned on a collection of Dutch translations of instruction-tuning datasets, using SFT followed by DPO.
- **ChocoLlama-2-7B-tokentrans-base**: A language-adapted version of Meta's Llama-2-7b, using a Dutch RoBERTa-based tokenizer. The token embeddings of this model were reinitialized using the token translation algorithm proposed by [Remy et al.](https://arxiv.org/pdf/2310.03477). The model was subsequently fine-tuned on the same Dutch dataset as ChocoLlama-2-7B-base, again using LoRa.
- **ChocoLlama-2-7B-tokentrans-instruct**: An instruction-tuned version of ChocoLlama-2-7B-tokentrans-base, fine-tuned on the same dataset as ChocoLlama-2-7B-instruct, again using SFT followed by DPO.
- **Llama-3-ChocoLlama-8B-base**: A language-adapted version of Meta's Llama-8-8B, fine-tuned on the same Dutch dataset as ChocoLlama-2-7B-base, again using LoRa.
- **Llama-3-ChocoLlama-instruct**: An instruction-tuned version of Llama-3-ChocoLlama-8B-base, fine-tuned on the same dataset as ChocoLlama-2-7B-instruct, again using SFT followed by DPO.
As far as we are aware, Llama-3-ChocoLlama-8B-instruct sets a new state-of-the-art for Dutch open models in its weight class.
### Model Description
- **Developed by:** [Matthieu Meeus](https://huggingface.co/matthieumeeus97), [Anthony Rathé](https://huggingface.co/anthonyrathe)
- **Funded by:** [Vlaams Supercomputer Centrum](https://www.vscentrum.be/), through a grant of apx. 40K GPU hours (NVIDIA H100-80GB)
- **Language(s):** Dutch
- **License:** [Llama-2 Community License](https://ai.meta.com/llama/license/)
- **Finetuned from model:** [Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf)
### Model Sources
- **Repository:** Will be released soon.
- **Paper:** Will be released soon.
## Uses
### Direct Use
Since this is a base model, we do not recommend using it for your use-cases directly. We instead recommend:
1. Fine-tuning this model to your specific use-case
2. Leveraging the instruction-tuned version of this model
### Downstream Use
Since this model is a base model, it can easily be adapted to specific use-cases that required Dutch language understanding and generation. We expect this model to be particularly useful for use-cases in the domains which were explicitly covered in our dataset, e.g. the analysis and/or generation of:
- Dutch job descriptions
- Dutch corporate filings
- Dutch legislation
### Out-of-Scope Use
- Use-cases requiring a chat-style interface: since this is a base model, it cannot be used reliably for turn-based chat interaction. Please refer to the instruction-tuned version of this model instead.
- Use-cases requiring understanding or generation of text in languages other than Dutch: the dataset on which this model was fine-tuned does not contain data in languages other than Dutch, hence we expect significant catastrophic forgetting to have occured for English, which is the language Llama-2 was originally trained for.
## Bias, Risks, and Limitations
We have taken care to include only widely used and high-quality data in our dataset. Some of this data has been filtered by the original creators.
However we did not explicitly conduct any additional filtering of this dataset with regards to biased or otherwise harmful content.
### Recommendations
We recommend fine-tuning this model to your curated data to maximally avoid undesirable outputs.
## How to Get Started with the Model
Use the code below to get started with the model.
```
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('ChocoLlama/ChocoLlama-2-7B-base')
model = AutoModelForCausalLM.from_pretrained('ChocoLlama/ChocoLlama-2-7B-base')
```
## Training Details
### Training Data
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
[More Information Needed]
### Training Procedure
This model was fine-tuned using low-rank (LoRa) adapatation with trainable embeddings, for a total of 4% trainable parameters.
#### Training Hyperparameters
- **Training regime:** bf16 non-mixed precision
- **Epochs:** 1
- **LoRa parameters:**
- R: 8
- Alpha: 32
- Trainable modules: q_proj, v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj, embed_tokens, lm_head
- LoRa dropout: 0.05
- **Learning Rate:**
- Scheduler: StepLR
- Step size: 6212
- Learning rate: 0.0003
- Gamma: 0.85
- **Other parameters:**
- Minibatch size: 16
- Gradient accumulation steps: 8
- Parallelization factor: 8
- Weight decay: 0
## Evaluation
### Quantitative evaluation
We have evaluated our models on several industry-standard Dutch benchmarks, translated from their original versions. The results can be found in the table below, together with results from several other prominent Dutch models.
| Model | ARC | HellaSwag | MMLU | TruthfulQA | Avg. |
|----------------------------------------------|----------------|----------------|----------------|----------------|----------------|
| **Llama-3-ChocoLlama-instruct** | **0.48** | **0.66** | **0.49** | **0.49** | **0.53** |
| llama-3-8B-rebatch | 0.44 | 0.64 | 0.46 | 0.48 | 0.51 |
| llama-3-8B-instruct | 0.47 | 0.59 | 0.47 | 0.52 | 0.51 |
| llama-3-8B | 0.44 | 0.64 | 0.47 | 0.45 | 0.5 |
| Reynaerde-7B-Chat | 0.44 | 0.62 | 0.39 | 0.52 | 0.49 |
| **Llama-3-ChocoLlama-base** | **0.45** | **0.64** | **0.44** | **0.44** | **0.49** |
| zephyr-7b-beta | 0.43 | 0.58 | 0.43 | 0.53 | 0.49 |
| geitje-7b-ultra | 0.40 | 0.66 | 0.36 | 0.49 | 0.48 |
| **ChocoLlama-2-7B-tokentrans-instruct** | **0.45** | **0.62** | **0.34** | **0.42** | **0.46** |
| mistral-7b-v0.1 | 0.43 | 0.58 | 0.37 | 0.45 | 0.46 |
| **ChocoLlama-2-7B-tokentrans-base** | **0.42** | **0.61** | **0.32** | **0.43** | **0.45** |
| **ChocoLlama-2-7B-instruct** | **0.36** | **0.57** | **0.33** | **0.45** | **0.43 |
| **ChocoLlama-2-7B-base** | **0.35** | **0.56** | **0.31** | **0.43** | **0.41** |
| llama-2-7b-chat-hf | 0.36 | 0.49 | 0.33 | 0.44 | 0.41 |
| llama-2-7b-hf | 0.36 | 0.51 | 0.32 | 0.41 | 0.40 |
On average, Llama-3-ChocoLlama-instruct surpasses the previous state-of-the-art on these benchmarks.
### Qualitative evaluation
### Compute Infrastructure
All ChocoLlama models have been trained on the compute cluster provided by the [Flemish Supercomputer Center (VSC)](https://www.vscentrum.be/). We used 8 to 16 NVIDIA H100 GPU's with 80 GB of VRAM. |