|
--- |
|
license: cc-by-nc-4.0 |
|
library_name: peft |
|
tags: |
|
- alignment-handbook |
|
- generated_from_trainer |
|
- trl |
|
- sft |
|
- geitje |
|
- fingeitje |
|
- dutch |
|
- nl |
|
- finance |
|
base_model: BramVanroy/GEITje-7B-ultra |
|
datasets: |
|
- snoels/FinGEITje-sft |
|
model-index: |
|
- name: snoels/FinGEITje-7B-sft |
|
results: [] |
|
language: |
|
- nl |
|
pipeline_tag: text-generation |
|
inference: false |
|
--- |
|
|
|
<p align="center" style="margin:0;padding:0"> |
|
<img src="https://huggingface.co/snoels/FinGEITje-7B-sft/resolve/main/fingeitje-banner.png" alt="FinGEITje Banner" width="1000"/> |
|
</p> |
|
|
|
<div style="margin:auto; text-align:center"> |
|
<h1 style="margin-bottom: 0; font-size: 2em;">π FinGEITje 7B</h1> |
|
<em style="font-size: 1em;">A large open Dutch Financial language model.</em> |
|
</div> |
|
|
|
This model is a fine-tuned version of [BramVanroy/GEITje-7B-ultra](https://huggingface.co/BramVanroy/GEITje-7B-ultra) on the [snoels/FinGEITje-sft](https://huggingface.co/datasets/snoels/FinGEITje-sft) dataset. |
|
|
|
## π Model Description |
|
|
|
FinGEITje 7B is a large open Dutch financial language model with 7 billion parameters, based on Mistral 7B. It has been further trained on Dutch financial texts, enhancing its proficiency in the Dutch language and its knowledge of financial topics. As a result, FinGEITje provides more accurate and relevant responses in the domain of finance. |
|
|
|
## π Training and Evaluation Data |
|
|
|
### Training Data |
|
|
|
FinGEITje 7B was fine-tuned on the [snoels/FinGEITje-sft](https://huggingface.co/datasets/snoels/FinGEITje-sft) dataset, which consists of translated and processed Dutch financial texts. This dataset includes a wide range of financial topics and instruction tuning data. |
|
|
|
#### Data Processing Steps |
|
|
|
1. **Translation**: Original instruction tuning datasets were translated into Dutch using a specialized translation service to maintain the integrity of financial terminology. |
|
2. **Post-processing**: The translated data underwent post-processing to correct any translation inconsistencies and to format it according to the original dataset structure. |
|
3. **Formatting**: The data was formatted to match the style and requirements of instruction tuning datasets, ensuring compatibility with the fine-tuning process. |
|
4. **Filtering**: A Dutch language check and predefined validation checks were applied to filter out any low-quality or irrelevant data. |
|
|
|
### Evaluation Data |
|
|
|
The model was evaluated using: |
|
|
|
- **[snoels/FinDutchBench](https://huggingface.co/datasets/snoels/FinDutchBench)**: A Dutch financial benchmark dataset designed to assess the model's performance on various financial tasks. |
|
|
|
## βοΈ Training Procedure |
|
|
|
FinGEITje was trained following the methodology described in the [Alignment Handbook](https://github.com/huggingface/alignment-handbook). |
|
|
|
### Training Configuration |
|
|
|
- The training configuration is based on the recipe outlined in the alignment handbook and can be found in the [config_qlora.yaml](https://github.com/snoels/fingeit/blob/master/src/training/sft/config_qlora.yaml) file. |
|
- The model was further trained using **QLoRA** (Quantized LoRA) for efficient fine-tuning with reduced computational resources. |
|
|
|
### Training Hyperparameters |
|
|
|
The following hyperparameters were used during training: |
|
|
|
- **Learning Rate**: 0.0002 |
|
- **Train Batch Size**: 4 |
|
- **Evaluation Batch Size**: 8 |
|
- **Seed**: 42 |
|
- **Distributed Type**: Multi-GPU |
|
- **Gradient Accumulation Steps**: 2 |
|
- **Total Train Batch Size**: 8 |
|
- **Optimizer**: Adam with betas=(0.9, 0.999) and epsilon=1e-08 |
|
- **LR Scheduler Type**: Cosine |
|
- **Warmup Ratio**: 0.1 |
|
- **Number of Epochs**: 1 |
|
|
|
### Training Results |
|
|
|
| Training Loss | Epoch | Step | Validation Loss | |
|
|---------------|-------|------|-----------------| |
|
| 0.406 | 1.0 | 3922 | 0.3928 | |
|
|
|
### Evaluation Package |
|
|
|
The evaluation package includes a set of metrics defined per task, grouped per dataset to evaluate the model's performance across different financial domains. The evaluation notebooks are available: |
|
|
|
- **[Evaluation in Dutch](https://github.com/snoels/fingeit/blob/master/notebooks/evaluation_nl.ipynb)**: Assesses the model's performance on the Dutch financial benchmark dataset. |
|
- **[Evaluation in English](https://github.com/snoels/fingeit/blob/master/notebooks/evaluation_en.ipynb)**: Evaluates the model's performance on English financial benchmarks for comparison purposes. |
|
|
|
### Framework Versions |
|
|
|
- **PEFT**: 0.7.1 |
|
- **Transformers**: 4.39.0.dev0 |
|
- **PyTorch**: 2.1.2 |
|
- **Datasets**: 2.14.6 |
|
- **Tokenizers**: 0.15.2 |
|
|
|
## π οΈ How to Use |
|
|
|
FinGEITje 7B can be utilized using the Hugging Face Transformers library along with PEFT to load the LoRA adapters efficiently. |
|
|
|
### Installation |
|
|
|
Ensure you have the necessary libraries installed: |
|
|
|
```bash |
|
pip install torch transformers peft accelerate |
|
``` |
|
|
|
### Loading the Model |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
from peft import PeftModel |
|
|
|
# Load the tokenizer |
|
tokenizer = AutoTokenizer.from_pretrained("BramVanroy/GEITje-7B-ultra", use_fast=False) |
|
|
|
# Load the base model |
|
base_model = AutoModelForCausalLM.from_pretrained("BramVanroy/GEITje-7B-ultra", device_map='auto') |
|
|
|
# Load the FinGEITje model with PEFT adapters |
|
model = PeftModel.from_pretrained(base_model, "snoels/FinGEITje-7B-sft", device_map='auto') |
|
``` |
|
|
|
### Generating Text |
|
|
|
```python |
|
# Prepare the input |
|
input_text = "Wat zijn de laatste trends in de Nederlandse banksector?" |
|
input_ids = tokenizer.encode(input_text, return_tensors='pt').to(model.device) |
|
|
|
# Generate a response |
|
outputs = model.generate(input_ids, max_length=200, num_return_sequences=1) |
|
response = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
|
print(response) |
|
``` |
|
|
|
## π§ Limitations and Future Work |
|
|
|
While FinGEITje 7B demonstrates significant improvements in understanding and generating Dutch financial content, certain limitations exist: |
|
|
|
- **Data Cutoff**: The model's knowledge is limited to the data it was trained on and may not include the most recent developments in the financial sector. |
|
- **Accuracy Concerns**: The model may generate incorrect or outdated information. Users should verify critical information with reliable sources. |
|
- **Biases**: Potential biases in the training data may affect the neutrality and fairness of the model's responses. |
|
- **Language Scope**: Primarily designed for Dutch; performance in other languages is not optimized. |
|
- **Ethical Use**: Users should ensure that the model's outputs comply with ethical standards and do not promote misinformation or harmful content. |
|
|
|
### Future Work |
|
|
|
- **Data Updates**: Incorporate more recent and diverse financial datasets to keep the model up-to-date. |
|
- **Bias Mitigation**: Implement techniques to identify and reduce biases in the model's outputs. |
|
- **Performance Enhancement**: Fine-tune on more specialized financial topics and complex financial tasks. |
|
- **Multilingual Expansion**: Extend support to other languages relevant to the financial sector in the Netherlands and Europe. |
|
|
|
## π Acknowledgements |
|
|
|
We would like to thank: |
|
|
|
- **Rijgersberg** ([GitHub](https://github.com/Rijgersberg)) for creating [GEITje](https://github.com/Rijgersberg/GEITje), one of the first Dutch foundation models, and for contributing significantly to the development of Dutch language models. |
|
- **Bram Vanroy** ([GitHub](https://github.com/BramVanroy)) for creating [GEITje-7B-ultra](https://huggingface.co/BramVanroy/GEITje-7B-ultra), an open-source Dutch chat model, and for sharing training, translation, and evaluation resources. |
|
- **Contributors of the [Alignment Handbook](https://github.com/huggingface/alignment-handbook)** for providing valuable resources that guided the development and training process of FinGEITje. |
|
- **Silverfin** for their collaboration in this research. Silverfin, a Belgian scale-up focused on building an accountancy cloud service, provided valuable insights and resources that were instrumental in the development of FinGEITje. More about their work can be found at [Silverfin](https://silverfin.com/). |
|
|
|
## π Citation |
|
[Link to the paper](https://arxiv.org/abs/2410.12835) |
|
|
|
If you use FinGEITje in your work, please cite: |
|
|
|
```bibtex |
|
@article{FinGEITje2024, |
|
title={A Dutch Financial Large Language Model}, |
|
author={Noels, Sander and De Blaere, Jorne and De Bie, Tijl}, |
|
journal={arXiv preprint arXiv:2410.12835}, |
|
year={2024}, |
|
url={https://arxiv.org/abs/2410.12835} |
|
} |
|
``` |
|
|
|
## π License |
|
|
|
This model is licensed under the [Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/) license. |
|
|
|
## π§ Contact |
|
|
|
For any inquiries or questions, please contact [Sander Noels](mailto:sander.noels@ugent.be). |