Triangle104 commited on
Commit
b111a6c
·
verified ·
1 Parent(s): b50d873

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +133 -0
README.md CHANGED
@@ -15,6 +15,139 @@ tags:
15
  This model was converted to GGUF format from [`allura-org/G2-9B-Sugarquill-v0`](https://huggingface.co/allura-org/G2-9B-Sugarquill-v0) using llama.cpp via the ggml.ai's [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space.
16
  Refer to the [original model card](https://huggingface.co/allura-org/G2-9B-Sugarquill-v0) for more details on the model.
17
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
  ## Use with llama.cpp
19
  Install llama.cpp through brew (works on Mac and Linux)
20
 
 
15
  This model was converted to GGUF format from [`allura-org/G2-9B-Sugarquill-v0`](https://huggingface.co/allura-org/G2-9B-Sugarquill-v0) using llama.cpp via the ggml.ai's [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space.
16
  Refer to the [original model card](https://huggingface.co/allura-org/G2-9B-Sugarquill-v0) for more details on the model.
17
 
18
+ ---
19
+ Model details:
20
+ -
21
+ An experimental continued pretrain of Gemma-2-9B-It-SPPO-Iter3 on assorted short story data from the web. I was trying to diversify Gemma's prose, without completely destroying it's smarts. I think I half-succeeded? This model could have used another epoch of training, but even this is already more creative and descriptive than it's base model, w/o becoming too silly. Doesn't seem to have degraded much in terms of core abilities as well. Should be usable both for RP and raw completion storywriting. I originally planned to use this in a merge, but I feel like this model is interesting enough to be released on it's own as well.
22
+
23
+ Model was trained by Auri.
24
+
25
+ Dedicated to Cahvay, who wanted a Gemma finetune from me for months by now, and to La Rata, who loves storywriter models.
26
+
27
+ GGUFs by Prodeus: https://huggingface.co/allura-org/G2-9B-Sugarquill-v0-GGUF
28
+
29
+ Training notes
30
+
31
+ This model was trained for 2 epochs on 10k rows (~18.7M tokens), taken equally from Erebus-87k and r_shortstories_24k datasets. It was trained on 8xH100 SXM node for 30 minutes with rsLoRA. I got complete nonsense reported to my wandb during this run, and logging stopped altogether after step 13 for some reason. Seems to be directly related to Gemma, as my training setup worked flawlessly for Qwen. Thanks to Kearm for helping with setting up LF on that node and to Featherless for providing it for EVA-Qwen2.5 (and this model, unknowingly lol) training.
32
+
33
+ Format
34
+
35
+ Model responds to Gemma instruct formatting, exactly like it's base model.
36
+
37
+ <bos><start_of_turn>user
38
+ {user message}<end_of_turn>
39
+ <start_of_turn>model
40
+ {response}<end_of_turn><eos>
41
+
42
+ Training config
43
+
44
+ See LLaMA-Factory config
45
+ ### Model
46
+ model_name_or_path: UCLA-AGI/Gemma-2-9B-It-SPPO-Iter3
47
+ #ref_model: # Reference model for RL (optional, for everything besides SimPO, which doesn't take it at all)
48
+ #ref_model_quantization_bit: 8 # 8 or 4
49
+
50
+ ### Method
51
+ stage: pt # pt, sft, rm, ppo, kto, dpo (includes orpo and simpo)
52
+ do_train: true
53
+ finetuning_type: lora # full, freeze or lora
54
+ lora_target: all
55
+ #pref_beta: 0.1
56
+ #pref_loss: simpo # sigmoid (dpo), orpo, simpo, ipo, hinge
57
+
58
+ ### Reward model
59
+ #reward_model: RLHFlow/ArmoRM-Llama3-8B-v0.1 # or sfairXC/FsfairX-Gemma2-RM-v0.1 or nvidia/Llama-3.1-Nemotron-70B-Reward-HF
60
+ #reward_model_type: full # full, lora, api
61
+ #reward_model_adapters: # Path to RM LoRA adapter(s) if using a LoRA RM
62
+ #reward_model_quantization_bit: 8 # 4 or 8
63
+
64
+ ### Freeze
65
+ #freeze_trainable_layers: # The number of trainable layers for freeze (partial-parameter) fine-tuning. Positive number means n last layers to train, negative - n first layers to train
66
+ #freeze_trainable_modules: # Name(s) of trainable modules for freeze (partial-parameter) fine-tuning. Use commas to separate
67
+ #freeze_extra_modules: # Name(s) of modules apart from hidden layers to be set as trainable. Use commas to separate
68
+
69
+ ### LoRA
70
+ #loraplus_lr_ratio: 8.0
71
+ #loraplus_lr_embedding:
72
+ use_dora: false
73
+ use_rslora: true
74
+ lora_rank: 64 # 64 is optimal for most trains on instruct, if training on base - use rslora or dora
75
+ lora_alpha: 32
76
+ lora_dropout: 0.05
77
+ #pissa_init: true
78
+ #pissa_iter: 16
79
+ #pissa_convert: true
80
+
81
+ ### QLoRA
82
+ quantization_bit: 8 # 2,3,4,5,6,8 in HQQ, 4 or 8 in bnb
83
+ quantization_method: hqq # bitsandbytes or hqq
84
+
85
+ ### DeepSpeed
86
+ deepspeed: examples/deepspeed/ds_z2_config.json # ds_z3_config.json or ds_z2_config.json which is required for HQQ on multigpu
87
+
88
+ ### Dataset
89
+ dataset: sugarquill-10k # define in data/dataset_info.json
90
+ cutoff_len: 8192
91
+ max_samples: 10000
92
+ overwrite_cache: true
93
+ preprocessing_num_workers: 16
94
+ #template: chatml
95
+
96
+ ### Output
97
+ output_dir: saves/gemma/lora/sugarquill-1
98
+ logging_steps: 3
99
+ save_steps: 50
100
+ plot_loss: true
101
+ compute_accuracy: true
102
+ overwrite_output_dir: true
103
+
104
+ ### Train
105
+ per_device_train_batch_size: 1 # Effective b/s == per-device b/s * grad accum steps * number of GPUs
106
+ gradient_accumulation_steps: 8
107
+ learning_rate: 3.0e-5
108
+ optim: paged_adamw_8bit # paged_adamw_8bit or adamw_torch usually
109
+ num_train_epochs: 2.0
110
+ lr_scheduler_type: cosine # cosine, constant or linear
111
+ warmup_ratio: 0.05
112
+ bf16: true
113
+ ddp_timeout: 180000000
114
+ packing: true
115
+ max_grad_norm: 1.0
116
+
117
+ ### Opts
118
+ flash_attn: fa2 # auto, disabled, sdpa, fa2 | Gemma will fallback to eager
119
+ enable_liger_kernel: true # Pretty much must have if it works
120
+ #use_unsloth: true # May not work with multigpu idk
121
+ #use_adam_mini: true # Comment optim if using this
122
+
123
+ ### Eval
124
+ val_size: 0.1
125
+ per_device_eval_batch_size: 1
126
+ eval_strategy: steps
127
+ eval_steps: 0.05
128
+
129
+ ### Misc
130
+ include_num_input_tokens_seen: true
131
+ ddp_find_unused_parameters: false # Stupid thing tries to start distributed training otherwise
132
+ upcast_layernorm: true
133
+
134
+ ### Inference for PPO
135
+ #max_new_tokens: 512
136
+ #temperature: 0.8
137
+ #top_k: 0
138
+ #top_p: 0.8
139
+
140
+ ### Tracking
141
+ report_to: wandb # or tensorboard or mlflow | LOGIN BEFORE STARTING TRAIN OR ELSE IT WILL CRASH
142
+ run_name: G2-9B-Sugarquill-1
143
+
144
+ ### Merge Adapter
145
+ #export_dir: models/G2-9B-Sugarquill
146
+ #export_size: 4
147
+ #export_device: gpu
148
+ #export_legacy_format: false
149
+
150
+ ---
151
  ## Use with llama.cpp
152
  Install llama.cpp through brew (works on Mac and Linux)
153