Envoid's picture
Update README.md
3d191c7 verified
metadata
license: cc-by-nc-4.0
tags:
  - not-for-all-audiences

Caution: This model is much more inclined to output adult content than its predecessor and was tuned on data geared at both SFW and NSFW roleplaying. For mature audiences only. Always exercise all caution when using experimental finetunes and merges.

Llama-3.05-NT-Storybreaker-Ministral-70B

Is my first experiment in reverse-distillation of model capabilities of a smaller model onto a larger model.

mistralai/Ministral-8B-Instruct-2410 has some very novel RP behaviors that make it a very interesting choice as an RP model but at the end of the day it's still just an 8B model. So this model is an early attempt at instilling its positive qualities onto a larger and more capable model.

Starting Model:

This model began as Llama-3.05-Nemotron-Tenyxchat-Storybreaker-70B

The Dataset:

I created a custom single-turn RP dataset for the model.

I started out with the infamous 'leaked undislop' dataset.

I used a script to format the conversations into single-turn SillyTavern style roleplaying prompts.

I used another script to run those prompts through Ministral.

Finally using pattern matching I removed a lot of the formatting from the original prompts in order to aid with generalization.

The Training:

Using qlora-pipe I ran a qlora on Nemotron-Tenyxhcat-Storybreaker with the following notable parameters:

  • rank: 16
  • alpha: 32
  • dropout rate: 0.6
  • learning rate: 2e-6
  • epochs: 2

High Dropout Rate Training:

The atypically high dropout rate was chosen after some unreleased experimentation inspired by the Arxiv paper: Fine-tuning with Very Large Dropout (Jianyu Zhang, Léon Bottou)

Which prescribes the use of a very high dropout rate (0.9 in their case) as a method of improving out-of-distribution performance. Further discussion on various internet spaces regarding high dropout training lead to a recommendation of 0.6 as the ideal dropout rate for optimal fitting during finetuning.

Merging

The LoRA adapter was then merged with the original model and then the adapted model was SLERP merged back onto the original model at a 40/60 rate in order to blend the new behavior with the old.

Results:

The resulting model can be very 'sloppy' at higher temperatures due to the mating of the different 'slop' between Llama-3 and Ministral.

The following comparison on a single turn SillyTavern roleplay test is presented for subjective judgment as a result.

CAUTION: Mature Language/Themes

The comparison utilizes deterministic sampling to better illustrate the model differences.

I tried to construct the prompt template between Ministral and the Llama-3 (for both versions of Storybreaker) to be as close as possible but an exact match is not possible due to differences in structures between Mistral prompt formatting and Llama-3 prompt formatting.

While the results are entirely based on subjective preference I find the flow of action within the Ministral-Infused model to be less of a short loop like in the original model and more like a continuous advancing flow of actions as in the Ministral model.

Surprisingly I feel the Ministral infused model also improves in both characterization and following the flow of the original scenario. It's much less baited into NSFW output by the jailbreak that is built into the prompt template.

Overall the model can be rather stingy with EOT tokens when used at higher temperatures and rather rigid at lower temperatures. Lowering the temperature definitely improves the 'slop' overall.

Possible Avenues for Improvement:

What I was able to do with the training was greatly limited by the VRAM limitations of my home setup. I feel the results could probably be improved with both a higher LoRA rank and higher sequence length.

The original dataset had over 8000 entires but about 25% of those had to be dropped during preprocessing on account of not fitting within the alotted sequence length.

If I had unlimited VRAM I would probably choose to do a full finetune on a much broader variety of context lengths as the dataset made for this experiment primarily simulated the model's response to the first human message in the conversation.

More training epochs could also potentially improve results. 2 epochs was chosen because it was what I could finish in a single day.

So far I've found it to be a fun model to role play with and definitely worth sharing but I can't gaurantee satisfactory results outside of the scope of training.