Update README.md
Browse files
README.md
CHANGED
@@ -4,6 +4,14 @@ tags:
|
|
4 |
model-index:
|
5 |
- name: distilgpt2-finetuned-finance
|
6 |
results: []
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
7 |
---
|
8 |
|
9 |
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
|
@@ -11,22 +19,125 @@ should probably proofread and complete it, then remove this comment. -->
|
|
11 |
|
12 |
# distilgpt2-finetuned-finance
|
13 |
|
14 |
-
This model
|
|
|
|
|
|
|
|
|
15 |
|
16 |
-
## Model description
|
17 |
|
18 |
-
|
19 |
|
20 |
-
|
21 |
|
22 |
-
|
|
|
|
|
|
|
|
|
|
|
23 |
|
24 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
25 |
|
26 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
27 |
|
28 |
## Training procedure
|
29 |
|
|
|
|
|
30 |
### Training hyperparameters
|
31 |
|
32 |
The following hyperparameters were used during training:
|
@@ -45,4 +156,4 @@ The following hyperparameters were used during training:
|
|
45 |
- Transformers 4.30.2
|
46 |
- Pytorch 2.0.1+cu117
|
47 |
- Datasets 2.13.1
|
48 |
-
- Tokenizers 0.13.3
|
|
|
4 |
model-index:
|
5 |
- name: distilgpt2-finetuned-finance
|
6 |
results: []
|
7 |
+
license: apache-2.0
|
8 |
+
datasets:
|
9 |
+
- causal-lm/finance
|
10 |
+
- gbharti/finance-alpaca
|
11 |
+
- PaulAdversarial/all_news_finance_sm_1h2023
|
12 |
+
- winddude/reddit_finance_43_250k
|
13 |
+
language:
|
14 |
+
- en
|
15 |
---
|
16 |
|
17 |
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
|
|
|
19 |
|
20 |
# distilgpt2-finetuned-finance
|
21 |
|
22 |
+
This model is a fine-tuned version of distilgpt2 on the the combination of 4 different finance datasets:
|
23 |
+
- [causal-lm/finance](https://huggingface.co/datasets/causal-lm/finance)
|
24 |
+
- [gbharti/finance-alpaca](https://huggingface.co/datasets/gbharti/finance-alpaca)
|
25 |
+
- [PaulAdversarial/all_news_finance_sm_1h2023](https://huggingface.co/datasets/PaulAdversarial/all_news_finance_sm_1h2023)
|
26 |
+
- [winddude/reddit_finance_43_250k](https://huggingface.co/datasets/winddude/reddit_finance_43_250k)
|
27 |
|
|
|
28 |
|
29 |
+
## Training and evaluation data
|
30 |
|
31 |
+
One can reproduce the dataset using the following code:
|
32 |
|
33 |
+
```python
|
34 |
+
# load dataset
|
35 |
+
dataset_1 = load_dataset("gbharti/finance-alpaca")
|
36 |
+
dataset_2 = load_dataset("PaulAdversarial/all_news_finance_sm_1h2023")
|
37 |
+
dataset_3 = load_dataset("winddude/reddit_finance_43_250k")
|
38 |
+
dataset_4 = load_dataset("causal-lm/finance")
|
39 |
|
40 |
+
# create a column called text
|
41 |
+
dataset_1 = dataset_1.map(
|
42 |
+
lambda example: {"text": example["instruction"] + " " + example["output"]},
|
43 |
+
num_proc=4,
|
44 |
+
)
|
45 |
+
dataset_1 = dataset_1.remove_columns(["input", "instruction", "output"])
|
46 |
+
|
47 |
+
dataset_2 = dataset_2.map(
|
48 |
+
lambda example: {"text": example["title"] + " " + example["description"]},
|
49 |
+
num_proc=4,
|
50 |
+
)
|
51 |
+
dataset_2 = dataset_2.remove_columns(
|
52 |
+
["_id", "main_domain", "title", "description", "created_at"]
|
53 |
+
)
|
54 |
+
|
55 |
+
dataset_3 = dataset_3.map(
|
56 |
+
lambda example: {
|
57 |
+
"text": example["title"] + " " + example["selftext"] + " " + example["body"]
|
58 |
+
},
|
59 |
+
num_proc=4,
|
60 |
+
)
|
61 |
+
dataset_3 = dataset_3.remove_columns(
|
62 |
+
[
|
63 |
+
"id",
|
64 |
+
"title",
|
65 |
+
"selftext",
|
66 |
+
"z_score",
|
67 |
+
"normalized_score",
|
68 |
+
"subreddit",
|
69 |
+
"body",
|
70 |
+
"comment_normalized_score",
|
71 |
+
"combined_score",
|
72 |
+
]
|
73 |
+
)
|
74 |
+
|
75 |
+
dataset_4 = dataset_4.map(
|
76 |
+
lambda example: {"text": example["instruction"] + " " + example["output"]},
|
77 |
+
num_proc=4,
|
78 |
+
)
|
79 |
+
dataset_4 = dataset_4.remove_columns(["input", "instruction", "output"])
|
80 |
+
|
81 |
+
# combine and split train test sets
|
82 |
+
combined_dataset = concatenate_datasets(
|
83 |
+
[
|
84 |
+
dataset_1["train"],
|
85 |
+
dataset_2["train"],
|
86 |
+
dataset_3["train"],
|
87 |
+
dataset_4["train"],
|
88 |
+
dataset_4["validation"],
|
89 |
+
]
|
90 |
+
)
|
91 |
+
|
92 |
+
datasets = combined_dataset.train_test_split(test_size=0.2)
|
93 |
|
94 |
+
```
|
95 |
+
|
96 |
+
## Inference example
|
97 |
+
|
98 |
+
```python
|
99 |
+
from transformers import pipeline
|
100 |
+
|
101 |
+
generator = pipeline(model="lxyuan/distilgpt2-finetuned-finance")
|
102 |
+
|
103 |
+
generator("Tesla is",
|
104 |
+
pad_token_id=generator.tokenizer.eos_token_id,
|
105 |
+
max_new_tokens=200,
|
106 |
+
num_return_sequences=2
|
107 |
+
)
|
108 |
+
|
109 |
+
>>>
|
110 |
+
{'generated_text':
|
111 |
+
'Tesla is likely going to have a "market crash" over 20 years - I believe I\'m just not
|
112 |
+
sure how this is going to affect the world. \n\nHowever, I would like to see this play out
|
113 |
+
as a global financial crisis. With US interest rates already high, a crash in global real
|
114 |
+
estate prices means that people are likely to feel pressure on assets that are less well
|
115 |
+
served by the assets the US government gives them. \n\nWould these things help you in your
|
116 |
+
retirement? I\'m fairly new to Wall Street, and it makes me think that you should have a
|
117 |
+
bit more control over your assets (I’m not super involved in stock picking, but I’ve heard
|
118 |
+
many times that governments can help their citizens), right? As another commenter has put
|
119 |
+
it: there\'s something called a market crash that could occur in the second world country
|
120 |
+
for most markets (I don\'t know how that would fit under US laws if I had done all of the
|
121 |
+
above. \n\n'
|
122 |
+
},
|
123 |
+
{'generated_text':
|
124 |
+
"Tesla is on track to go from 1.46 to 1.79 per cent growth in Q3 (the fastest pace so far
|
125 |
+
in the US), which will push down the share price.\n\nWhile the dividend could benefit Amazon’s
|
126 |
+
growth, earnings also aren’t expected to be high at all, the company's annual earnings could
|
127 |
+
be an indication that investors have a strong plan to boost sales by the end of the year if
|
128 |
+
earnings season continues.\n\nThe latest financials showed earnings as of the end of July,
|
129 |
+
followed by the earnings guidance from analysts at the Canadian Real Estate Association, which
|
130 |
+
showed that Amazon’s revenues were up over $1.8 Trillion, which is a far cry from what was
|
131 |
+
expected in early Q1.\n\nAmazon has grown the share price by as much as 1.6 percent since June
|
132 |
+
2020. Analysts had predicted that earnings growth in the stock would drop to 0.36 per cent for
|
133 |
+
2020, which would lead to Amazon’"
|
134 |
+
}
|
135 |
+
```
|
136 |
|
137 |
## Training procedure
|
138 |
|
139 |
+
Notebook link: [here](https://github.com/LxYuan0420/nlp/blob/main/notebooks/finetune_distilgpt2_language_model_on_finance_dataset.ipynb)
|
140 |
+
|
141 |
### Training hyperparameters
|
142 |
|
143 |
The following hyperparameters were used during training:
|
|
|
156 |
- Transformers 4.30.2
|
157 |
- Pytorch 2.0.1+cu117
|
158 |
- Datasets 2.13.1
|
159 |
+
- Tokenizers 0.13.3
|