euclaise commited on
Commit
e8c1857
1 Parent(s): e05eb85

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -16
README.md CHANGED
@@ -8,29 +8,20 @@ datasets:
8
  - euclaise/TinyCoT
9
  - euclaise/reddit-instruct
10
  - sablo/oasst2_curated
 
11
  metrics:
12
  - accuracy
13
  ---
14
 
15
 
16
 
17
-
18
-
19
-
20
- **Note: A bug has been discovered in the Memphis-CoT training code, and the model is currently being re-trained. Please do not make quants or merges or anything else until I have retrained.**
21
-
22
-
23
-
24
-
25
-
26
-
27
-
28
 
29
 
30
 
31
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64137e2150358a805203cbac/DlTWku8gant1yx6NaxqJX.png)
32
 
33
- Memphis-CoT is a finetune of [StableLM 3b 4e1t](stabilityai/stablelm-3b-4e1t) on [TinyCoT](https://huggingface.co/datasets/euclaise/TinyCoT), along with [reddit-instruct](https://huggingface.co/datasets/euclaise/reddit-instruct) (subset to 5000 examples, excluding posts with brackets in the title) and a [curated](https://huggingface.co/datasets/sablo/oasst2_curated) subset of [oasst2](https://huggingface.co/datasets/OpenAssistant/oasst2).
34
 
35
  **Memphis was trained *only* on human data! No GPT generations here.**
36
 
@@ -42,12 +33,13 @@ I finetuned the model using an iterative rationale-bootstrapping procedure inspi
42
 
43
  First, I finetuned the model on all the datasets using a [MixCE](https://arxiv.org/abs/2305.16958) loss and [NEFTune](https://arxiv.org/abs/2310.05914), for 2 epochs.
44
 
45
- I then performed the following steps 4 times:
46
  1. Generate responses for each question in TinyCoT using the current model, check each response for correctness, and create a dataset of (correct, incorrect) pairs. Extra values are discarded, such that each correct and incorrect response is unique.
47
- 2. Finetune the model for 1 epoch using a ranking loss over length-normalized log-probabilities of each sequence, similar to [Preference Ranking Optimization](https://arxiv.org/abs/2306.17492), comparing the correct vs incorrect generated response. A standard CE loss over the ground-truth was included to prevent excessive drift.
48
 
49
  This should be more efficient than either STaR or SPIN, as it uses a ranking loss rather than rejection sampling (unlike STaR), and verifies correctness instead of assuming all model responses are incorrect (unlike SPIN).
50
 
 
51
 
52
  ## Prompt formats
53
 
@@ -81,7 +73,8 @@ The format for TinyCoT was:
81
  | [MPT 7B Instruct](https://hf.co/mosaicml/mpt-7b-instruct) | **7B** | **Human**+Anthropic | SFT | 2.05% | 24.12% | 11.01% |
82
  | [OpenLLaMA 7B v2 open-instruct](http://hf.co/VMware/open-llama-7b-v2-open-instruct) | **7B** | **Human** (nearly: ecqa is an exception) | SFT | 8.64% | 23.21% | 29.84% |
83
  | [StableLM Zephyr 3B](https://hf.co/stabilityai/stablelm-zephyr-3b) | 3B | GPT | DPO | possibly contaminated (45.72%) | **33.31%** | 0.91% |
84
- | [**Memphis-CoT 3B**](https://hf.co/euclaise/Memphis-CoT-3B) | 3B | **Human** | Self-teaching | **13.8%** | *26.24%* | **38.24%** |
 
85
  *5-shot, as performed automatically by LM Evaluation Harness bbh_cot_fewshot even with num_fewshot=0
86
 
87
  Memphis outperforms other primarily-human-data models that are over twice its size, along with SFT models of its size, and trades with the Zephyr DPO model. That said, Zephyr uses synthetic data, and *much* more of it.
@@ -121,7 +114,7 @@ For the rank finetuning:
121
  - Adalite optimizer, default hyperparameters of supertrainer2000 unless otherwise specified
122
  - Lambda of 0.01
123
  - LR of 5e-7
124
- - Rank loss weight of 5
125
  - Sequence length of 1024
126
  - Cosine schedule with 10% warmup
127
  - Frozen embeddings
 
8
  - euclaise/TinyCoT
9
  - euclaise/reddit-instruct
10
  - sablo/oasst2_curated
11
+ - euclaise/SciCoT
12
  metrics:
13
  - accuracy
14
  ---
15
 
16
 
17
 
18
+ *Now with a training bug fixed!*
 
 
 
 
 
 
 
 
 
 
19
 
20
 
21
 
22
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64137e2150358a805203cbac/DlTWku8gant1yx6NaxqJX.png)
23
 
24
+ Memphis-CoT is a finetune of [StableLM 3b 4e1t](stabilityai/stablelm-3b-4e1t) on [TinyCoT](https://huggingface.co/datasets/euclaise/TinyCoT), [SciCoT](https://huggingface.co/datasets/euclaise/SciCoT), along with [reddit-instruct](https://huggingface.co/datasets/euclaise/reddit-instruct) (subset to 5000 examples, excluding posts with brackets in the title) and a [curated](https://huggingface.co/datasets/sablo/oasst2_curated) subset of [oasst2](https://huggingface.co/datasets/OpenAssistant/oasst2).
25
 
26
  **Memphis was trained *only* on human data! No GPT generations here.**
27
 
 
33
 
34
  First, I finetuned the model on all the datasets using a [MixCE](https://arxiv.org/abs/2305.16958) loss and [NEFTune](https://arxiv.org/abs/2310.05914), for 2 epochs.
35
 
36
+ I then performed the following steps 3 times:
37
  1. Generate responses for each question in TinyCoT using the current model, check each response for correctness, and create a dataset of (correct, incorrect) pairs. Extra values are discarded, such that each correct and incorrect response is unique.
38
+ 2. Finetune the model for 1 epoch using a ranking loss over length-normalized log-probabilities of each sequence, similar to [Preference Ranking Optimization](https://arxiv.org/abs/2306.17492), comparing the correct vs incorrect generated response. Additionally, a standard CE loss over the chosen completion was included.
39
 
40
  This should be more efficient than either STaR or SPIN, as it uses a ranking loss rather than rejection sampling (unlike STaR), and verifies correctness instead of assuming all model responses are incorrect (unlike SPIN).
41
 
42
+ To prevent excessive drift, I kept the model weights as a moving average: After each generate+train cycle, I interpolated between the previous model weights and the updated weights using spherical linear interpolation (SLERP), with an interpolation factor of 0.99.
43
 
44
  ## Prompt formats
45
 
 
73
  | [MPT 7B Instruct](https://hf.co/mosaicml/mpt-7b-instruct) | **7B** | **Human**+Anthropic | SFT | 2.05% | 24.12% | 11.01% |
74
  | [OpenLLaMA 7B v2 open-instruct](http://hf.co/VMware/open-llama-7b-v2-open-instruct) | **7B** | **Human** (nearly: ecqa is an exception) | SFT | 8.64% | 23.21% | 29.84% |
75
  | [StableLM Zephyr 3B](https://hf.co/stabilityai/stablelm-zephyr-3b) | 3B | GPT | DPO | possibly contaminated (45.72%) | **33.31%** | 0.91% |
76
+ | [**Memphis-CoT 3B**](https://hf.co/euclaise/Memphis-CoT-3B) | 3B | **Human** | Self-teaching | **18.8%** | *27.22%* | **36.92%** |
77
+
78
  *5-shot, as performed automatically by LM Evaluation Harness bbh_cot_fewshot even with num_fewshot=0
79
 
80
  Memphis outperforms other primarily-human-data models that are over twice its size, along with SFT models of its size, and trades with the Zephyr DPO model. That said, Zephyr uses synthetic data, and *much* more of it.
 
114
  - Adalite optimizer, default hyperparameters of supertrainer2000 unless otherwise specified
115
  - Lambda of 0.01
116
  - LR of 5e-7
117
+ - Rank loss weight of 0.25
118
  - Sequence length of 1024
119
  - Cosine schedule with 10% warmup
120
  - Frozen embeddings