README.md · snoels/FinGEITje-7B-sft at main

FinGEITje-7B-sft / README.md

snoels

Update README.md

b8732af verified 26 days ago

preview code

raw

history blame contribute delete

8.68 kB

	---
	license: cc-by-nc-4.0
	library_name: peft
	tags:
	- alignment-handbook
	- generated_from_trainer
	- trl
	- sft
	- geitje
	- fingeitje
	- dutch
	- nl
	- finance
	base_model: BramVanroy/GEITje-7B-ultra
	datasets:
	- snoels/FinGEITje-sft
	model-index:
	- name: snoels/FinGEITje-7B-sft
	results: []
	language:
	- nl
	pipeline_tag: text-generation
	inference: false
	---

	<p align="center" style="margin:0;padding:0">
	<img src="https://huggingface.co/snoels/FinGEITje-7B-sft/resolve/main/fingeitje-banner.png" alt="FinGEITje Banner" width="1000"/>
	</p>

	<div style="margin:auto; text-align:center">
	<h1 style="margin-bottom: 0; font-size: 2em;">🐐 FinGEITje 7B</h1>
	<em style="font-size: 1em;">A large open Dutch Financial language model.</em>
	</div>

	This model is a fine-tuned version of [BramVanroy/GEITje-7B-ultra](https://huggingface.co/BramVanroy/GEITje-7B-ultra) on the [snoels/FinGEITje-sft](https://huggingface.co/datasets/snoels/FinGEITje-sft) dataset.

	## 📖 Model Description

	FinGEITje 7B is a large open Dutch financial language model with 7 billion parameters, based on Mistral 7B. It has been further trained on Dutch financial texts, enhancing its proficiency in the Dutch language and its knowledge of financial topics. As a result, FinGEITje provides more accurate and relevant responses in the domain of finance.

	## 📊 Training and Evaluation Data

	### Training Data

	FinGEITje 7B was fine-tuned on the [snoels/FinGEITje-sft](https://huggingface.co/datasets/snoels/FinGEITje-sft) dataset, which consists of translated and processed Dutch financial texts. This dataset includes a wide range of financial topics and instruction tuning data.

	#### Data Processing Steps

	1. Translation: Original instruction tuning datasets were translated into Dutch using a specialized translation service to maintain the integrity of financial terminology.
	2. Post-processing: The translated data underwent post-processing to correct any translation inconsistencies and to format it according to the original dataset structure.
	3. Formatting: The data was formatted to match the style and requirements of instruction tuning datasets, ensuring compatibility with the fine-tuning process.
	4. Filtering: A Dutch language check and predefined validation checks were applied to filter out any low-quality or irrelevant data.

	### Evaluation Data

	The model was evaluated using:

	- [snoels/FinDutchBench](https://huggingface.co/datasets/snoels/FinDutchBench): A Dutch financial benchmark dataset designed to assess the model's performance on various financial tasks.

	## ⚙️ Training Procedure

	FinGEITje was trained following the methodology described in the [Alignment Handbook](https://github.com/huggingface/alignment-handbook).

	### Training Configuration

	- The training configuration is based on the recipe outlined in the alignment handbook and can be found in the [config_qlora.yaml](https://github.com/snoels/fingeit/blob/master/src/training/sft/config_qlora.yaml) file.
	- The model was further trained using QLoRA (Quantized LoRA) for efficient fine-tuning with reduced computational resources.

	### Training Hyperparameters

	The following hyperparameters were used during training:

	- Learning Rate: 0.0002
	- Train Batch Size: 4
	- Evaluation Batch Size: 8
	- Seed: 42
	- Distributed Type: Multi-GPU
	- Gradient Accumulation Steps: 2
	- Total Train Batch Size: 8
	- Optimizer: Adam with betas=(0.9, 0.999) and epsilon=1e-08
	- LR Scheduler Type: Cosine
	- Warmup Ratio: 0.1
	- Number of Epochs: 1

	### Training Results

	\| Training Loss \| Epoch \| Step \| Validation Loss \|
	\|---------------\|-------\|------\|-----------------\|
	\| 0.406 \| 1.0 \| 3922 \| 0.3928 \|

	### Evaluation Package

	The evaluation package includes a set of metrics defined per task, grouped per dataset to evaluate the model's performance across different financial domains. The evaluation notebooks are available:

	- [Evaluation in Dutch](https://github.com/snoels/fingeit/blob/master/notebooks/evaluation_nl.ipynb): Assesses the model's performance on the Dutch financial benchmark dataset.
	- [Evaluation in English](https://github.com/snoels/fingeit/blob/master/notebooks/evaluation_en.ipynb): Evaluates the model's performance on English financial benchmarks for comparison purposes.

	### Framework Versions

	- PEFT: 0.7.1
	- Transformers: 4.39.0.dev0
	- PyTorch: 2.1.2
	- Datasets: 2.14.6
	- Tokenizers: 0.15.2

	## 🛠️ How to Use

	FinGEITje 7B can be utilized using the Hugging Face Transformers library along with PEFT to load the LoRA adapters efficiently.

	### Installation

	Ensure you have the necessary libraries installed:

	```bash
	pip install torch transformers peft accelerate
	```

	### Loading the Model

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	from peft import PeftModel

	# Load the tokenizer
	tokenizer = AutoTokenizer.from_pretrained("BramVanroy/GEITje-7B-ultra", use_fast=False)

	# Load the base model
	base_model = AutoModelForCausalLM.from_pretrained("BramVanroy/GEITje-7B-ultra", device_map='auto')

	# Load the FinGEITje model with PEFT adapters
	model = PeftModel.from_pretrained(base_model, "snoels/FinGEITje-7B-sft", device_map='auto')
	```

	### Generating Text

	```python
	# Prepare the input
	input_text = "Wat zijn de laatste trends in de Nederlandse banksector?"
	input_ids = tokenizer.encode(input_text, return_tensors='pt').to(model.device)

	# Generate a response
	outputs = model.generate(input_ids, max_length=200, num_return_sequences=1)
	response = tokenizer.decode(outputs[0], skip_special_tokens=True)

	print(response)
	```

	## 🚧 Limitations and Future Work

	While FinGEITje 7B demonstrates significant improvements in understanding and generating Dutch financial content, certain limitations exist:

	- Data Cutoff: The model's knowledge is limited to the data it was trained on and may not include the most recent developments in the financial sector.
	- Accuracy Concerns: The model may generate incorrect or outdated information. Users should verify critical information with reliable sources.
	- Biases: Potential biases in the training data may affect the neutrality and fairness of the model's responses.
	- Language Scope: Primarily designed for Dutch; performance in other languages is not optimized.
	- Ethical Use: Users should ensure that the model's outputs comply with ethical standards and do not promote misinformation or harmful content.

	### Future Work

	- Data Updates: Incorporate more recent and diverse financial datasets to keep the model up-to-date.
	- Bias Mitigation: Implement techniques to identify and reduce biases in the model's outputs.
	- Performance Enhancement: Fine-tune on more specialized financial topics and complex financial tasks.
	- Multilingual Expansion: Extend support to other languages relevant to the financial sector in the Netherlands and Europe.

	## 🙏 Acknowledgements

	We would like to thank:

	- Rijgersberg ([GitHub](https://github.com/Rijgersberg)) for creating [GEITje](https://github.com/Rijgersberg/GEITje), one of the first Dutch foundation models, and for contributing significantly to the development of Dutch language models.
	- Bram Vanroy ([GitHub](https://github.com/BramVanroy)) for creating [GEITje-7B-ultra](https://huggingface.co/BramVanroy/GEITje-7B-ultra), an open-source Dutch chat model, and for sharing training, translation, and evaluation resources.
	- Contributors of the [Alignment Handbook](https://github.com/huggingface/alignment-handbook) for providing valuable resources that guided the development and training process of FinGEITje.
	- Silverfin for their collaboration in this research. Silverfin, a Belgian scale-up focused on building an accountancy cloud service, provided valuable insights and resources that were instrumental in the development of FinGEITje. More about their work can be found at [Silverfin](https://silverfin.com/).

	## 📝 Citation
	[Link to the paper](https://arxiv.org/abs/2410.12835)

	If you use FinGEITje in your work, please cite:

	```bibtex
	@article{FinGEITje2024,
	title={A Dutch Financial Large Language Model},
	author={Noels, Sander and De Blaere, Jorne and De Bie, Tijl},
	journal={arXiv preprint arXiv:2410.12835},
	year={2024},
	url={https://arxiv.org/abs/2410.12835}
	}
	```

	## 📜 License

	This model is licensed under the [Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/) license.

	## 📧 Contact

	For any inquiries or questions, please contact [Sander Noels](mailto:sander.noels@ugent.be).