First iteration of the default generator LoRa for MiniHF. This model still functions as a base model while writing more coherent text.
Training procedure
This model was trained starting from the MiniHF Mistral SFT evaluator. It was created using the MiniHF Reinforcement Learning From AI Feedback pipeline:
accelerate launch rlaif_generator.py --resume minihf_evaluator_mistral_7b_v0.1 --output-path mistral_h_eval --kl-weight 1.0 --constitution hermes/hermes_constitution.txt --prompts hermes/hermes_prompts.txt --length 256 --batch-size 4 --grad-accum-steps 8
The tuning script was modified to use the AdamW optimizer with weight decay:
opt = optim.AdamW(model.parameters(), lr=1e-5, weight_decay=1e-2, betas=(0.9, 0.98))
This weight decay is based on the observation that RL tuning mode collapse can be undone by interpolating the weights of the base model with that of the RL tuned model. Here the specific recipe was to start from the MiniHF SFT evaluator, then apply weight decay and the KL penalty towards the base model weights to inject entropy back into the policy.
Prompt Bank and Constitution
The prompt bank using during tuning is in the hermes_prompts.txt
file found in this repo, the constitution in hermes_constitution.txt
Configuration
The following bitsandbytes
quantization config was used during training:
- quant_method: bitsandbytes
- load_in_8bit: False
- load_in_4bit: True
- llm_int8_threshold: 6.0
- llm_int8_skip_modules: None
- llm_int8_enable_fp32_cpu_offload: False
- llm_int8_has_fp16_weight: False
- bnb_4bit_quant_type: nf4
- bnb_4bit_use_double_quant: True
- bnb_4bit_compute_dtype: float16
Framework versions
PEFT 0.5.0
PEFT 0.5.0
PEFT 0.5.0
- Downloads last month
- 2