lxyuan
/

distilgpt2-finetuned-finance

@@ -4,6 +4,14 @@ tags:
 model-index:
 - name: distilgpt2-finetuned-finance
   results: []
 ---
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
@@ -11,22 +19,125 @@ should probably proofread and complete it, then remove this comment. -->
 # distilgpt2-finetuned-finance
-This model was trained from scratch on an unknown dataset.
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
 ## Training procedure
 ### Training hyperparameters
 The following hyperparameters were used during training:
@@ -45,4 +156,4 @@ The following hyperparameters were used during training:
 - Transformers 4.30.2
 - Pytorch 2.0.1+cu117
 - Datasets 2.13.1
-- Tokenizers 0.13.3

 model-index:
 - name: distilgpt2-finetuned-finance
   results: []
+license: apache-2.0
+datasets:
+- causal-lm/finance
+- gbharti/finance-alpaca
+- PaulAdversarial/all_news_finance_sm_1h2023
+- winddude/reddit_finance_43_250k
+language:
+- en
 ---
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 # distilgpt2-finetuned-finance
+This model is a fine-tuned version of distilgpt2 on the the combination of 4 different finance datasets:
+- [causal-lm/finance](https://huggingface.co/datasets/causal-lm/finance)
+- [gbharti/finance-alpaca](https://huggingface.co/datasets/gbharti/finance-alpaca)
+- [PaulAdversarial/all_news_finance_sm_1h2023](https://huggingface.co/datasets/PaulAdversarial/all_news_finance_sm_1h2023)
+- [winddude/reddit_finance_43_250k](https://huggingface.co/datasets/winddude/reddit_finance_43_250k)
+## Training and evaluation data
+One can reproduce the dataset using the following code:
+```python
+# load dataset
+dataset_1 = load_dataset("gbharti/finance-alpaca")
+dataset_2 = load_dataset("PaulAdversarial/all_news_finance_sm_1h2023")
+dataset_3 = load_dataset("winddude/reddit_finance_43_250k")
+dataset_4 = load_dataset("causal-lm/finance")
+# create a column called text
+dataset_1 = dataset_1.map(
+    lambda example: {"text": example["instruction"] + " " + example["output"]},
+    num_proc=4,
+)
+dataset_1 = dataset_1.remove_columns(["input", "instruction", "output"])
+dataset_2 = dataset_2.map(
+    lambda example: {"text": example["title"] + " " + example["description"]},
+    num_proc=4,
+)
+dataset_2 = dataset_2.remove_columns(
+    ["_id", "main_domain", "title", "description", "created_at"]
+)
+dataset_3 = dataset_3.map(
+    lambda example: {
+        "text": example["title"] + " " + example["selftext"] + " " + example["body"]
+    },
+    num_proc=4,
+)
+dataset_3 = dataset_3.remove_columns(
+    [
+        "id",
+        "title",
+        "selftext",
+        "z_score",
+        "normalized_score",
+        "subreddit",
+        "body",
+        "comment_normalized_score",
+        "combined_score",
+    ]
+)
+dataset_4 = dataset_4.map(
+    lambda example: {"text": example["instruction"] + " " + example["output"]},
+    num_proc=4,
+)
+dataset_4 = dataset_4.remove_columns(["input", "instruction", "output"])
+# combine and split train test sets
+combined_dataset = concatenate_datasets(
+    [
+        dataset_1["train"],
+        dataset_2["train"],
+        dataset_3["train"],
+        dataset_4["train"],
+        dataset_4["validation"],
+    ]
+)
+datasets = combined_dataset.train_test_split(test_size=0.2)
+```
+## Inference example
+```python
+from transformers import pipeline
+generator = pipeline(model="lxyuan/distilgpt2-finetuned-finance")
+generator("Tesla is",
+  pad_token_id=generator.tokenizer.eos_token_id,
+  max_new_tokens=200,
+  num_return_sequences=2
+)
+>>>
+{'generated_text':
+  'Tesla is likely going to have a "market crash" over 20 years - I believe I\'m just not
+  sure how this is going to affect the world. \n\nHowever, I would like to see this play out
+  as a global financial crisis. With US interest rates already high, a crash in global real
+  estate prices means that people are likely to feel pressure on assets that are less well
+  served by the assets the US government gives them. \n\nWould these things help you in your
+  retirement? I\'m fairly new to Wall Street, and it makes me think that you should have a
+  bit more control over your assets (I’m not super involved in stock picking, but I’ve heard
+  many times that governments can help their citizens), right? As another commenter has put
+  it: there\'s something called a market crash that could occur in the second world country
+  for most markets (I don\'t know how that would fit under US laws if I had done all of the
+  above. \n\n'
+},
+{'generated_text':
+  "Tesla is on track to go from 1.46 to 1.79 per cent growth in Q3 (the fastest pace so far
+in the US), which will push down the share price.\n\nWhile the dividend could benefit Amazon’s
+growth, earnings also aren’t expected to be high at all, the company's annual earnings could
+be an indication that investors have a strong plan to boost sales by the end of the year if
+earnings season continues.\n\nThe latest financials showed earnings as of the end of July,
+followed by the earnings guidance from analysts at the Canadian Real Estate Association, which
+showed that Amazon’s revenues were up over $1.8 Trillion, which is a far cry from what was
+expected in early Q1.\n\nAmazon has grown the share price by as much as 1.6 percent since June
+2020. Analysts had predicted that earnings growth in the stock would drop to 0.36 per cent for
+2020, which would lead to Amazon’"
+}
+```
 ## Training procedure
+Notebook link: [here](https://github.com/LxYuan0420/nlp/blob/main/notebooks/finetune_distilgpt2_language_model_on_finance_dataset.ipynb)
 ### Training hyperparameters
 The following hyperparameters were used during training:
 - Transformers 4.30.2
 - Pytorch 2.0.1+cu117
 - Datasets 2.13.1
+- Tokenizers 0.13.3