jukofyork
/

creative-writer-v0.1-alfa-35b

@@ -116,7 +116,7 @@ but whereas `lora_A` looks at the ***input*** to the transformation for "additiv
 - The dataset consisted of approximately 1000 pre-2012 books converted to Markdown (~180M tokens) using the same `dataset_combination_mode = 'concatenate'` and `dataset_type = 'textfile'` as tdrussell's [Llama-3-70B-Instruct-Storywriter](https://huggingface.co/tdrussell/Llama-3-70B-Instruct-Storywriter/discussions/2#66524e7eb47c060e536889a3) used.
 - I used the same `sequence_len = 8192` and `batch_size_tokens = 8192` as [Llama-3-70B-Instruct-Storywriter](https://huggingface.co/tdrussell/Llama-3-70B-Instruct-Storywriter/discussions/2#66524e7eb47c060e536889a3), but since I only target `down_proj` in a very specific way; I doubt this will affect the useable context length of the model, and 8k tokens should be around 2-3 user-AI rounds' worth of interaction in real terms.
 - I used `pipeline_stages = 2` and `"gradient_accumulation_steps": 16` to roughly match the "tokens-per-step" as [Llama-3-70B-Instruct-Storywriter](https://huggingface.co/tdrussell/Llama-3-70B-Instruct-Storywriter/discussions/2#66524e7eb47c060e536889a3) used.
-- I used a much lower learning-rate of `5e-6`, as the `5e-5` value used by [Llama-3-70B-Instruct-Storywriter](https://huggingface.co/tdrussell/Llama-3-70B-Instruct-Storywriter/discussions/2#66524e7eb47c060e536889a3) dropped the evaluation loss *far* too quickly; meaning that 90%+ of the samples weren't going to be used properly.
 - I set `lora_dropout = 0.0` as it doesn't really make sense to use with `epochs = 1`.
 - I left `weight_decay = 0.01` but not convinced this is really doing anything useful, and may actually even be harming the adaption of the early `down_proj` matrices where the gradient signal is likely to be much weaker.
 - I found via experimentation that setting `lora_rank` and `lora_alpha` to a very low value (as a form of [Spectral Regularization](https://huggingface.co/tdrussell/Llama-3-70B-Instruct-Storywriter/discussions/2#66524e7eb47c060e536889a3)), can cause the training to get stuck at [saddle-points](https://en.wikipedia.org/wiki/Saddle_point); particularly if using stock SGD instead of Adam.

 - The dataset consisted of approximately 1000 pre-2012 books converted to Markdown (~180M tokens) using the same `dataset_combination_mode = 'concatenate'` and `dataset_type = 'textfile'` as tdrussell's [Llama-3-70B-Instruct-Storywriter](https://huggingface.co/tdrussell/Llama-3-70B-Instruct-Storywriter/discussions/2#66524e7eb47c060e536889a3) used.
 - I used the same `sequence_len = 8192` and `batch_size_tokens = 8192` as [Llama-3-70B-Instruct-Storywriter](https://huggingface.co/tdrussell/Llama-3-70B-Instruct-Storywriter/discussions/2#66524e7eb47c060e536889a3), but since I only target `down_proj` in a very specific way; I doubt this will affect the useable context length of the model, and 8k tokens should be around 2-3 user-AI rounds' worth of interaction in real terms.
 - I used `pipeline_stages = 2` and `"gradient_accumulation_steps": 16` to roughly match the "tokens-per-step" as [Llama-3-70B-Instruct-Storywriter](https://huggingface.co/tdrussell/Llama-3-70B-Instruct-Storywriter/discussions/2#66524e7eb47c060e536889a3) used.
+- I used a much lower learning-rate of `5e-6`, as the `5e-5` value used by [Llama-3-70B-Instruct-Storywriter](https://huggingface.co/tdrussell/Llama-3-70B-Instruct-Storywriter/discussions/2#66524e7eb47c060e536889a3) dropped the evaluation loss *far* too quickly (likely due to adapting `down_proj` only being "almost convex").
 - I set `lora_dropout = 0.0` as it doesn't really make sense to use with `epochs = 1`.
 - I left `weight_decay = 0.01` but not convinced this is really doing anything useful, and may actually even be harming the adaption of the early `down_proj` matrices where the gradient signal is likely to be much weaker.
 - I found via experimentation that setting `lora_rank` and `lora_alpha` to a very low value (as a form of [Spectral Regularization](https://huggingface.co/tdrussell/Llama-3-70B-Instruct-Storywriter/discussions/2#66524e7eb47c060e536889a3)), can cause the training to get stuck at [saddle-points](https://en.wikipedia.org/wiki/Saddle_point); particularly if using stock SGD instead of Adam.