lxyuan commited on
Commit
e185be9
1 Parent(s): ef309d9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +119 -8
README.md CHANGED
@@ -4,6 +4,14 @@ tags:
4
  model-index:
5
  - name: distilgpt2-finetuned-finance
6
  results: []
 
 
 
 
 
 
 
 
7
  ---
8
 
9
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
@@ -11,22 +19,125 @@ should probably proofread and complete it, then remove this comment. -->
11
 
12
  # distilgpt2-finetuned-finance
13
 
14
- This model was trained from scratch on an unknown dataset.
 
 
 
 
15
 
16
- ## Model description
17
 
18
- More information needed
19
 
20
- ## Intended uses & limitations
21
 
22
- More information needed
 
 
 
 
 
23
 
24
- ## Training and evaluation data
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
  ## Training procedure
29
 
 
 
30
  ### Training hyperparameters
31
 
32
  The following hyperparameters were used during training:
@@ -45,4 +156,4 @@ The following hyperparameters were used during training:
45
  - Transformers 4.30.2
46
  - Pytorch 2.0.1+cu117
47
  - Datasets 2.13.1
48
- - Tokenizers 0.13.3
 
4
  model-index:
5
  - name: distilgpt2-finetuned-finance
6
  results: []
7
+ license: apache-2.0
8
+ datasets:
9
+ - causal-lm/finance
10
+ - gbharti/finance-alpaca
11
+ - PaulAdversarial/all_news_finance_sm_1h2023
12
+ - winddude/reddit_finance_43_250k
13
+ language:
14
+ - en
15
  ---
16
 
17
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 
19
 
20
  # distilgpt2-finetuned-finance
21
 
22
+ This model is a fine-tuned version of distilgpt2 on the the combination of 4 different finance datasets:
23
+ - [causal-lm/finance](https://huggingface.co/datasets/causal-lm/finance)
24
+ - [gbharti/finance-alpaca](https://huggingface.co/datasets/gbharti/finance-alpaca)
25
+ - [PaulAdversarial/all_news_finance_sm_1h2023](https://huggingface.co/datasets/PaulAdversarial/all_news_finance_sm_1h2023)
26
+ - [winddude/reddit_finance_43_250k](https://huggingface.co/datasets/winddude/reddit_finance_43_250k)
27
 
 
28
 
29
+ ## Training and evaluation data
30
 
31
+ One can reproduce the dataset using the following code:
32
 
33
+ ```python
34
+ # load dataset
35
+ dataset_1 = load_dataset("gbharti/finance-alpaca")
36
+ dataset_2 = load_dataset("PaulAdversarial/all_news_finance_sm_1h2023")
37
+ dataset_3 = load_dataset("winddude/reddit_finance_43_250k")
38
+ dataset_4 = load_dataset("causal-lm/finance")
39
 
40
+ # create a column called text
41
+ dataset_1 = dataset_1.map(
42
+ lambda example: {"text": example["instruction"] + " " + example["output"]},
43
+ num_proc=4,
44
+ )
45
+ dataset_1 = dataset_1.remove_columns(["input", "instruction", "output"])
46
+
47
+ dataset_2 = dataset_2.map(
48
+ lambda example: {"text": example["title"] + " " + example["description"]},
49
+ num_proc=4,
50
+ )
51
+ dataset_2 = dataset_2.remove_columns(
52
+ ["_id", "main_domain", "title", "description", "created_at"]
53
+ )
54
+
55
+ dataset_3 = dataset_3.map(
56
+ lambda example: {
57
+ "text": example["title"] + " " + example["selftext"] + " " + example["body"]
58
+ },
59
+ num_proc=4,
60
+ )
61
+ dataset_3 = dataset_3.remove_columns(
62
+ [
63
+ "id",
64
+ "title",
65
+ "selftext",
66
+ "z_score",
67
+ "normalized_score",
68
+ "subreddit",
69
+ "body",
70
+ "comment_normalized_score",
71
+ "combined_score",
72
+ ]
73
+ )
74
+
75
+ dataset_4 = dataset_4.map(
76
+ lambda example: {"text": example["instruction"] + " " + example["output"]},
77
+ num_proc=4,
78
+ )
79
+ dataset_4 = dataset_4.remove_columns(["input", "instruction", "output"])
80
+
81
+ # combine and split train test sets
82
+ combined_dataset = concatenate_datasets(
83
+ [
84
+ dataset_1["train"],
85
+ dataset_2["train"],
86
+ dataset_3["train"],
87
+ dataset_4["train"],
88
+ dataset_4["validation"],
89
+ ]
90
+ )
91
+
92
+ datasets = combined_dataset.train_test_split(test_size=0.2)
93
 
94
+ ```
95
+
96
+ ## Inference example
97
+
98
+ ```python
99
+ from transformers import pipeline
100
+
101
+ generator = pipeline(model="lxyuan/distilgpt2-finetuned-finance")
102
+
103
+ generator("Tesla is",
104
+ pad_token_id=generator.tokenizer.eos_token_id,
105
+ max_new_tokens=200,
106
+ num_return_sequences=2
107
+ )
108
+
109
+ >>>
110
+ {'generated_text':
111
+ 'Tesla is likely going to have a "market crash" over 20 years - I believe I\'m just not
112
+ sure how this is going to affect the world. \n\nHowever, I would like to see this play out
113
+ as a global financial crisis. With US interest rates already high, a crash in global real
114
+ estate prices means that people are likely to feel pressure on assets that are less well
115
+ served by the assets the US government gives them. \n\nWould these things help you in your
116
+ retirement? I\'m fairly new to Wall Street, and it makes me think that you should have a
117
+ bit more control over your assets (I’m not super involved in stock picking, but I’ve heard
118
+ many times that governments can help their citizens), right? As another commenter has put
119
+ it: there\'s something called a market crash that could occur in the second world country
120
+ for most markets (I don\'t know how that would fit under US laws if I had done all of the
121
+ above. \n\n'
122
+ },
123
+ {'generated_text':
124
+ "Tesla is on track to go from 1.46 to 1.79 per cent growth in Q3 (the fastest pace so far
125
+ in the US), which will push down the share price.\n\nWhile the dividend could benefit Amazon’s
126
+ growth, earnings also aren’t expected to be high at all, the company's annual earnings could
127
+ be an indication that investors have a strong plan to boost sales by the end of the year if
128
+ earnings season continues.\n\nThe latest financials showed earnings as of the end of July,
129
+ followed by the earnings guidance from analysts at the Canadian Real Estate Association, which
130
+ showed that Amazon’s revenues were up over $1.8 Trillion, which is a far cry from what was
131
+ expected in early Q1.\n\nAmazon has grown the share price by as much as 1.6 percent since June
132
+ 2020. Analysts had predicted that earnings growth in the stock would drop to 0.36 per cent for
133
+ 2020, which would lead to Amazon’"
134
+ }
135
+ ```
136
 
137
  ## Training procedure
138
 
139
+ Notebook link: [here](https://github.com/LxYuan0420/nlp/blob/main/notebooks/finetune_distilgpt2_language_model_on_finance_dataset.ipynb)
140
+
141
  ### Training hyperparameters
142
 
143
  The following hyperparameters were used during training:
 
156
  - Transformers 4.30.2
157
  - Pytorch 2.0.1+cu117
158
  - Datasets 2.13.1
159
+ - Tokenizers 0.13.3