File size: 8,676 Bytes
684edd4 b19ceaf d602d9f 684edd4 4633e49 684edd4 d602d9f 684edd4 4633e49 d602d9f 684edd4 d602d9f 29b62cb d602d9f dde6baa 74b693a aeef675 dde6baa d602d9f 684edd4 8614969 d602d9f 796463a d602d9f 8614969 d602d9f 796463a 684edd4 d602d9f 684edd4 d602d9f 796463a 684edd4 d602d9f 684edd4 d602d9f 684edd4 796463a 684edd4 8614969 684edd4 796463a 684edd4 d602d9f 796463a d602d9f 684edd4 d602d9f 684edd4 d602d9f 796463a d602d9f 8614969 d602d9f 796463a d602d9f 8614969 d602d9f 796463a d602d9f 8614969 d602d9f 796463a b8732af 8614969 842ddd0 d602d9f 842ddd0 d602d9f 842ddd0 d602d9f 8614969 d602d9f 8614969 684edd4 d602d9f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 |
---
license: cc-by-nc-4.0
library_name: peft
tags:
- alignment-handbook
- generated_from_trainer
- trl
- sft
- geitje
- fingeitje
- dutch
- nl
- finance
base_model: BramVanroy/GEITje-7B-ultra
datasets:
- snoels/FinGEITje-sft
model-index:
- name: snoels/FinGEITje-7B-sft
results: []
language:
- nl
pipeline_tag: text-generation
inference: false
---
<p align="center" style="margin:0;padding:0">
<img src="https://huggingface.co/snoels/FinGEITje-7B-sft/resolve/main/fingeitje-banner.png" alt="FinGEITje Banner" width="1000"/>
</p>
<div style="margin:auto; text-align:center">
<h1 style="margin-bottom: 0; font-size: 2em;">π FinGEITje 7B</h1>
<em style="font-size: 1em;">A large open Dutch Financial language model.</em>
</div>
This model is a fine-tuned version of [BramVanroy/GEITje-7B-ultra](https://huggingface.co/BramVanroy/GEITje-7B-ultra) on the [snoels/FinGEITje-sft](https://huggingface.co/datasets/snoels/FinGEITje-sft) dataset.
## π Model Description
FinGEITje 7B is a large open Dutch financial language model with 7 billion parameters, based on Mistral 7B. It has been further trained on Dutch financial texts, enhancing its proficiency in the Dutch language and its knowledge of financial topics. As a result, FinGEITje provides more accurate and relevant responses in the domain of finance.
## π Training and Evaluation Data
### Training Data
FinGEITje 7B was fine-tuned on the [snoels/FinGEITje-sft](https://huggingface.co/datasets/snoels/FinGEITje-sft) dataset, which consists of translated and processed Dutch financial texts. This dataset includes a wide range of financial topics and instruction tuning data.
#### Data Processing Steps
1. **Translation**: Original instruction tuning datasets were translated into Dutch using a specialized translation service to maintain the integrity of financial terminology.
2. **Post-processing**: The translated data underwent post-processing to correct any translation inconsistencies and to format it according to the original dataset structure.
3. **Formatting**: The data was formatted to match the style and requirements of instruction tuning datasets, ensuring compatibility with the fine-tuning process.
4. **Filtering**: A Dutch language check and predefined validation checks were applied to filter out any low-quality or irrelevant data.
### Evaluation Data
The model was evaluated using:
- **[snoels/FinDutchBench](https://huggingface.co/datasets/snoels/FinDutchBench)**: A Dutch financial benchmark dataset designed to assess the model's performance on various financial tasks.
## βοΈ Training Procedure
FinGEITje was trained following the methodology described in the [Alignment Handbook](https://github.com/huggingface/alignment-handbook).
### Training Configuration
- The training configuration is based on the recipe outlined in the alignment handbook and can be found in the [config_qlora.yaml](https://github.com/snoels/fingeit/blob/master/src/training/sft/config_qlora.yaml) file.
- The model was further trained using **QLoRA** (Quantized LoRA) for efficient fine-tuning with reduced computational resources.
### Training Hyperparameters
The following hyperparameters were used during training:
- **Learning Rate**: 0.0002
- **Train Batch Size**: 4
- **Evaluation Batch Size**: 8
- **Seed**: 42
- **Distributed Type**: Multi-GPU
- **Gradient Accumulation Steps**: 2
- **Total Train Batch Size**: 8
- **Optimizer**: Adam with betas=(0.9, 0.999) and epsilon=1e-08
- **LR Scheduler Type**: Cosine
- **Warmup Ratio**: 0.1
- **Number of Epochs**: 1
### Training Results
| Training Loss | Epoch | Step | Validation Loss |
|---------------|-------|------|-----------------|
| 0.406 | 1.0 | 3922 | 0.3928 |
### Evaluation Package
The evaluation package includes a set of metrics defined per task, grouped per dataset to evaluate the model's performance across different financial domains. The evaluation notebooks are available:
- **[Evaluation in Dutch](https://github.com/snoels/fingeit/blob/master/notebooks/evaluation_nl.ipynb)**: Assesses the model's performance on the Dutch financial benchmark dataset.
- **[Evaluation in English](https://github.com/snoels/fingeit/blob/master/notebooks/evaluation_en.ipynb)**: Evaluates the model's performance on English financial benchmarks for comparison purposes.
### Framework Versions
- **PEFT**: 0.7.1
- **Transformers**: 4.39.0.dev0
- **PyTorch**: 2.1.2
- **Datasets**: 2.14.6
- **Tokenizers**: 0.15.2
## π οΈ How to Use
FinGEITje 7B can be utilized using the Hugging Face Transformers library along with PEFT to load the LoRA adapters efficiently.
### Installation
Ensure you have the necessary libraries installed:
```bash
pip install torch transformers peft accelerate
```
### Loading the Model
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("BramVanroy/GEITje-7B-ultra", use_fast=False)
# Load the base model
base_model = AutoModelForCausalLM.from_pretrained("BramVanroy/GEITje-7B-ultra", device_map='auto')
# Load the FinGEITje model with PEFT adapters
model = PeftModel.from_pretrained(base_model, "snoels/FinGEITje-7B-sft", device_map='auto')
```
### Generating Text
```python
# Prepare the input
input_text = "Wat zijn de laatste trends in de Nederlandse banksector?"
input_ids = tokenizer.encode(input_text, return_tensors='pt').to(model.device)
# Generate a response
outputs = model.generate(input_ids, max_length=200, num_return_sequences=1)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```
## π§ Limitations and Future Work
While FinGEITje 7B demonstrates significant improvements in understanding and generating Dutch financial content, certain limitations exist:
- **Data Cutoff**: The model's knowledge is limited to the data it was trained on and may not include the most recent developments in the financial sector.
- **Accuracy Concerns**: The model may generate incorrect or outdated information. Users should verify critical information with reliable sources.
- **Biases**: Potential biases in the training data may affect the neutrality and fairness of the model's responses.
- **Language Scope**: Primarily designed for Dutch; performance in other languages is not optimized.
- **Ethical Use**: Users should ensure that the model's outputs comply with ethical standards and do not promote misinformation or harmful content.
### Future Work
- **Data Updates**: Incorporate more recent and diverse financial datasets to keep the model up-to-date.
- **Bias Mitigation**: Implement techniques to identify and reduce biases in the model's outputs.
- **Performance Enhancement**: Fine-tune on more specialized financial topics and complex financial tasks.
- **Multilingual Expansion**: Extend support to other languages relevant to the financial sector in the Netherlands and Europe.
## π Acknowledgements
We would like to thank:
- **Rijgersberg** ([GitHub](https://github.com/Rijgersberg)) for creating [GEITje](https://github.com/Rijgersberg/GEITje), one of the first Dutch foundation models, and for contributing significantly to the development of Dutch language models.
- **Bram Vanroy** ([GitHub](https://github.com/BramVanroy)) for creating [GEITje-7B-ultra](https://huggingface.co/BramVanroy/GEITje-7B-ultra), an open-source Dutch chat model, and for sharing training, translation, and evaluation resources.
- **Contributors of the [Alignment Handbook](https://github.com/huggingface/alignment-handbook)** for providing valuable resources that guided the development and training process of FinGEITje.
- **Silverfin** for their collaboration in this research. Silverfin, a Belgian scale-up focused on building an accountancy cloud service, provided valuable insights and resources that were instrumental in the development of FinGEITje. More about their work can be found at [Silverfin](https://silverfin.com/).
## π Citation
[Link to the paper](https://arxiv.org/abs/2410.12835)
If you use FinGEITje in your work, please cite:
```bibtex
@article{FinGEITje2024,
title={A Dutch Financial Large Language Model},
author={Noels, Sander and De Blaere, Jorne and De Bie, Tijl},
journal={arXiv preprint arXiv:2410.12835},
year={2024},
url={https://arxiv.org/abs/2410.12835}
}
```
## π License
This model is licensed under the [Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/) license.
## π§ Contact
For any inquiries or questions, please contact [Sander Noels](mailto:sander.noels@ugent.be). |