snoels commited on
Commit
d602d9f
·
verified ·
1 Parent(s): b19ceaf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +220 -35
README.md CHANGED
@@ -6,65 +6,250 @@ tags:
6
  - generated_from_trainer
7
  - trl
8
  - sft
 
 
 
 
 
9
  base_model: BramVanroy/GEITje-7B-ultra
10
  datasets:
11
  - snoels/FinGEITje-sft
12
  model-index:
13
- - name: fingeitje
14
  results: []
15
  language:
16
  - nl
 
 
17
  ---
18
 
19
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
20
- should probably proofread and complete it, then remove this comment. -->
 
 
 
 
 
 
21
 
22
- # fingeitje
23
 
24
- This model is a fine-tuned version of [BramVanroy/GEITje-7B-ultra](https://huggingface.co/BramVanroy/GEITje-7B-ultra) on the ice-hands/fingeit_training_3 dataset.
25
  It achieves the following results on the evaluation set:
26
- - Loss: 0.3928
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
- ## Model description
29
 
30
- More information needed
 
 
 
31
 
32
- ## Intended uses & limitations
33
 
34
- More information needed
35
 
36
- ## Training and evaluation data
37
 
38
- More information needed
39
 
40
- ## Training procedure
41
 
42
- ### Training hyperparameters
 
 
 
 
 
 
43
 
44
  The following hyperparameters were used during training:
45
- - learning_rate: 0.0002
46
- - train_batch_size: 4
47
- - eval_batch_size: 8
48
- - seed: 42
49
- - distributed_type: multi-GPU
50
- - gradient_accumulation_steps: 2
51
- - total_train_batch_size: 8
52
- - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
53
- - lr_scheduler_type: cosine
54
- - lr_scheduler_warmup_ratio: 0.1
55
- - num_epochs: 1
56
-
57
- ### Training results
 
58
 
59
  | Training Loss | Epoch | Step | Validation Loss |
60
- |:-------------:|:-----:|:----:|:---------------:|
61
- | 0.406 | 1.0 | 3922 | 0.3928 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
 
 
63
 
64
- ### Framework versions
65
 
66
- - PEFT 0.7.1
67
- - Transformers 4.39.0.dev0
68
- - Pytorch 2.1.2
69
- - Datasets 2.14.6
70
- - Tokenizers 0.15.2
 
6
  - generated_from_trainer
7
  - trl
8
  - sft
9
+ - geitje
10
+ - fingeitje
11
+ - dutch
12
+ - nl
13
+ - finance
14
  base_model: BramVanroy/GEITje-7B-ultra
15
  datasets:
16
  - snoels/FinGEITje-sft
17
  model-index:
18
+ - name: snoels/FinGEITje-7B-sft
19
  results: []
20
  language:
21
  - nl
22
+ pipeline_tag: text-generation
23
+ inference: false
24
  ---
25
 
26
+ <p align="center" style="margin:0;padding:0">
27
+ <img src="https://huggingface.co/snoels/FinGEITje-7B/resolve/main/fingeitje-logo.jpeg" alt="FinGEITje logo" width="800"/>
28
+ </p>
29
+
30
+ <div style="margin:auto; text-align:center">
31
+ <h1 style="margin-bottom: 0">🐐 FinGEITje 7B</h1>
32
+ <em>A large open Dutch Financial language model.</em>
33
+ </div>
34
 
35
+ This model is a fine-tuned version of [BramVanroy/GEITje-7B-ultra](https://huggingface.co/BramVanroy/GEITje-7B-ultra) on the [snoels/FinGEITje-sft](https://huggingface.co/datasets/snoels/FinGEITje-sft) dataset.
36
 
 
37
  It achieves the following results on the evaluation set:
38
+ - **Loss**: 0.3928
39
+
40
+ ## Model Description
41
+
42
+ FinGEITje 7B is a large open Dutch financial language model with 7 billion parameters, based on Mistral 7B. It has been further trained on Dutch financial texts, enhancing its proficiency in the Dutch language and its knowledge of financial topics. As a result, FinGEITje provides more accurate and relevant responses in the domain of finance, including areas such as banking, investment, insurance, and economic policy.
43
+
44
+ FinGEITje builds upon the foundation of GEITje models, which are Dutch language models based on Mistral 7B and have been further pretrained on extensive Dutch corpora.
45
+
46
+ ## Intended Uses & Limitations
47
+
48
+ ### Intended Use
49
+
50
+ - **Educational Purposes**: Researching Dutch financial language patterns, generating financial texts for study, or aiding in financial education.
51
+ - **Content Generation**: Assisting in drafting Dutch financial reports, summaries, articles, or other finance-related content.
52
+ - **Financial Chatbots**: Integrating into chatbots to provide responses related to Dutch financial matters, customer service, or financial advice simulations.
53
+ - **Financial Analysis**: Supporting analysis by generating insights or summarizing financial data in Dutch.
54
+
55
+ ### Limitations
56
+
57
+ - **Not for Real-time Financial Decisions**: The model does not have access to real-time data and may not reflect the most current financial regulations or market conditions. It should not be used for making actual financial decisions.
58
+ - **Accuracy**: While trained on financial data, the model may still produce incorrect or nonsensical answers, especially for prompts outside its training scope.
59
+ - **Bias**: The model may reflect biases present in its training data, potentially affecting the neutrality of its responses.
60
+ - **Ethical Use**: Users should ensure that the model's outputs comply with ethical standards and do not promote misinformation or harmful content.
61
+
62
+ ### Ethical Considerations
63
+
64
+ - **License Restrictions**: FinGEITje is released under a non-commercial license (CC BY-NC 4.0). Commercial use is prohibited without obtaining proper permissions.
65
+ - **Misinformation**: Users should verify the information generated by the model, particularly for critical applications.
66
+ - **Data Privacy**: The model should not be used to generate or infer personal identifiable information or confidential data.
67
+
68
+ ## Training and Evaluation Data
69
+
70
+ ### Training Data
71
+
72
+ FinGEITje 7B was fine-tuned on the [snoels/FinGEITje-sft](https://huggingface.co/datasets/snoels/FinGEITje-sft) dataset, which consists of translated and processed Dutch financial texts. This dataset includes a wide range of financial topics and instruction tuning data, ensuring that the model gains a comprehensive understanding of the financial domain in Dutch.
73
 
74
+ #### Data Processing Steps
75
 
76
+ 1. **Translation**: Original instruction tuning datasets were translated into Dutch using a specialized translation service to maintain the integrity of financial terminology.
77
+ 2. **Post-processing**: The translated data underwent post-processing to correct any translation inconsistencies and to format it according to the original dataset structure.
78
+ 3. **Formatting**: The data was formatted to match the style and requirements of instruction tuning datasets, ensuring compatibility with the fine-tuning process.
79
+ 4. **Filtering**: A Dutch language check and predefined validation checks were applied to filter out any low-quality or irrelevant data, enhancing the overall dataset quality.
80
 
81
+ ### Evaluation Data
82
 
83
+ The model was evaluated using:
84
 
85
+ - **[snoels/FinDutchBench](https://huggingface.co/datasets/snoels/FinDutchBench)**: A Dutch financial benchmark dataset designed to assess the model's performance on various financial tasks. It includes tasks that test the model's understanding of financial concepts, its ability to generate financial texts, and its comprehension of Dutch financial regulations.
86
 
87
+ ## Training Procedure
88
 
89
+ FinGEITje was trained following the methodology described in the [Alignment Handbook](https://github.com/huggingface/alignment-handbook), utilizing a structured training pipeline to ensure optimal model performance.
90
 
91
+ ### Training Configuration
92
+
93
+ - The training configuration is based on the recipe outlined in the alignment handbook and can be found in the `config_qlora.yml` file.
94
+ - The model was further trained using **QLoRA** (Quantized LoRA) for efficient fine-tuning with reduced computational resources.
95
+ - Training was conducted on a multi-GPU setup to handle the large model size and extensive dataset.
96
+
97
+ ### Training Hyperparameters
98
 
99
  The following hyperparameters were used during training:
100
+
101
+ - **Learning Rate**: 0.0002
102
+ - **Train Batch Size**: 4
103
+ - **Evaluation Batch Size**: 8
104
+ - **Seed**: 42
105
+ - **Distributed Type**: Multi-GPU
106
+ - **Gradient Accumulation Steps**: 2
107
+ - **Total Train Batch Size**: 8
108
+ - **Optimizer**: Adam with betas=(0.9, 0.999) and epsilon=1e-08
109
+ - **LR Scheduler Type**: Cosine
110
+ - **Warmup Ratio**: 0.1
111
+ - **Number of Epochs**: 1
112
+
113
+ ### Training Results
114
 
115
  | Training Loss | Epoch | Step | Validation Loss |
116
+ |---------------|-------|------|-----------------|
117
+ | 0.406 | 1.0 | 3922 | 0.3928 |
118
+
119
+ ### Evaluation Package
120
+
121
+ The evaluation package includes a set of metrics defined per task, grouped per dataset to evaluate the model's performance across different financial domains. The evaluation notebooks are available:
122
+
123
+ - **[Evaluation in Dutch](https://github.com/your-repo/evaluation_nl.ipynb)**: Assesses the model's performance on the Dutch financial benchmark dataset.
124
+ - **[Evaluation in English](https://github.com/your-repo/evaluation_en.ipynb)**: Evaluates the model's performance on English financial benchmarks for comparison purposes.
125
+
126
+ ### Framework Versions
127
+
128
+ - **PEFT**: 0.7.1
129
+ - **Transformers**: 4.39.0.dev0
130
+ - **PyTorch**: 2.1.2
131
+ - **Datasets**: 2.14.6
132
+ - **Tokenizers**: 0.15.2
133
+
134
+ ## How to Use
135
+
136
+ FinGEITje 7B can be utilized using the Hugging Face Transformers library along with PEFT to load the LoRA adapters efficiently.
137
+
138
+ ### Installation
139
+
140
+ Ensure you have the necessary libraries installed:
141
+
142
+ ```bash
143
+ pip install torch transformers peft accelerate
144
+ ```
145
+
146
+ ### Loading the Model
147
+
148
+ ```python
149
+ from transformers import AutoTokenizer, AutoModelForCausalLM
150
+ from peft import PeftModel
151
+
152
+ # Load the tokenizer
153
+ tokenizer = AutoTokenizer.from_pretrained("BramVanroy/GEITje-7B-ultra", use_fast=False)
154
+
155
+ # Load the base model
156
+ base_model = AutoModelForCausalLM.from_pretrained("BramVanroy/GEITje-7B-ultra", device_map='auto')
157
+
158
+ # Load the FinGEITje model with PEFT adapters
159
+ model = PeftModel.from_pretrained(base_model, "snoels/FinGEITje-7B", device_map='auto')
160
+ ```
161
+
162
+ ### Generating Text
163
+
164
+ ```python
165
+ # Prepare the input
166
+ input_text = "Wat zijn de laatste trends in de Nederlandse banksector?"
167
+ input_ids = tokenizer.encode(input_text, return_tensors='pt').to(model.device)
168
+
169
+ # Generate a response
170
+ outputs = model.generate(input_ids, max_length=200, num_return_sequences=1)
171
+ response = tokenizer.decode(outputs[0], skip_special_tokens=True)
172
+
173
+ print(response)
174
+ ```
175
+
176
+ ### Interactive Conversation
177
+
178
+ ```python
179
+ from transformers import pipeline, Conversation
180
+
181
+ # Initialize the conversational pipeline
182
+ chatbot = pipeline(
183
+ "conversational",
184
+ model=model,
185
+ tokenizer=tokenizer,
186
+ device_map='auto'
187
+ )
188
+
189
+ # Start a conversation
190
+ start_messages = [
191
+ {"role": "system", "content": "Je bent een behulpzame financiële assistent."},
192
+ {"role": "user", "content": "Kun je me iets vertellen over de huidige rentevoeten voor hypotheekleningen in Nederland?"}
193
+ ]
194
+
195
+ conversation = Conversation()
196
+ for msg in start_messages:
197
+ conversation.add_message(**msg)
198
+
199
+ # Get the assistant's response
200
+ response = chatbot(conversation)
201
+ print(response)
202
+ ```
203
+
204
+ ## Limitations and Future Work
205
+
206
+ While FinGEITje 7B demonstrates significant improvements in understanding and generating Dutch financial content, certain limitations exist:
207
+
208
+ - **Data Cutoff**: The model's knowledge is limited to the data it was trained on and may not include the most recent developments in the financial sector.
209
+ - **Accuracy Concerns**: The model may generate incorrect or outdated information. Users should verify critical information with reliable sources.
210
+ - **Biases**: Potential biases in the training data may affect the neutrality and fairness of the model's responses.
211
+ - **Language Scope**: Primarily designed for Dutch; performance in other languages is not optimized.
212
+
213
+ ### Future Work
214
+
215
+ - **Data Updates**: Incorporate more recent and diverse financial datasets to keep the model up-to-date.
216
+ - **Bias Mitigation**: Implement techniques to identify and reduce biases in the model's outputs.
217
+ - **Performance Enhancement**: Fine-tune on more specialized financial topics and complex financial tasks.
218
+ - **Multilingual Expansion**: Extend support to other languages relevant to the financial sector in the Netherlands and Europe.
219
+
220
+ ## Acknowledgements
221
+
222
+ We would like to thank:
223
+
224
+ - **Rijgersberg** ([GitHub](https://github.com/Rijgersberg)) for creating [GEITje](https://github.com/Rijgersberg/GEITje), one of the first Dutch foundation models, and for contributing significantly to the development of Dutch language models.
225
+ - **Bram Vanroy** ([GitHub](https://github.com/BramVanroy)) for creating [GEITje-7B-ultra](https://huggingface.co/BramVanroy/GEITje-7B-ultra), an open-source Dutch chat model, and for sharing training, translation, and evaluation resources.
226
+ - **Contributors of the Alignment Handbook** for providing valuable resources that guided the development and training process of FinGEITje.
227
+
228
+ ## Citation
229
+
230
+ If you use FinGEITje in your work, please cite:
231
+
232
+ ```bibtex
233
+ @article{FinGEITje2024,
234
+ title={A Dutch Financial Large Language Model},
235
+ author={Noels, Sander and De Blaere, Jorne and De Bie, Tijl},
236
+ journal={arXiv preprint arXiv:xxxx.xxxxx},
237
+ year={2024},
238
+ url={https://arxiv.org/abs/xxxx.xxxxx}
239
+ }
240
+ ```
241
+
242
+ ## License
243
+
244
+ This model is licensed under the [Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/) license.
245
+
246
+ ### Summary
247
+
248
+ - **Attribution**: You must give appropriate credit, provide a link to the license, and indicate if changes were made.
249
+ - **NonCommercial**: You may not use the material for commercial purposes.
250
 
251
+ For the full license text, please refer to the [license document](https://creativecommons.org/licenses/by-nc/4.0/legalcode).
252
 
253
+ ## Contact
254
 
255
+ For any inquiries or questions, please contact [Sander Noels](mailto:sander.noels@ugent.be).