|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- Gryphe/Sonnet3.5-Charcard-Roleplay |
|
- Doctor-Shotgun/no-robots-sharegpt |
|
language: |
|
- en |
|
base_model: |
|
- HuggingFaceTB/SmolLM2-135M |
|
base_model_relation: finetune |
|
pipeline_tag: text-generation |
|
--- |
|
*Just a little bit of experiment and practice to see whether a 135-million parameter model can be finetuned for roleplay.* |
|
|
|
# SmolRP-135M-v0.9! |
|
|
|
A finetune of [HuggingFaceTB/SmolLM2-135M](https://huggingface.co/HuggingFaceTB/SmolLM2-135M), |
|
first on the [Doctor-Shotgun/no-robots-sharegpt](https://huggingface.co/datasets/Doctor-Shotgun/no-robots-sharegpt), |
|
then on *some* of the [Gryphe/Sonnet3.5-Charcard-Roleplay](https://huggingface.co/datasets/Gryphe/Sonnet3.5-Charcard-Roleplay). |
|
|
|
## Why it's made |
|
|
|
I have always been fascinated with smaller models with less than 1 billion parameters. |
|
Ever since the [KoboldAI/OPT-350M-Nerys-v2](https://huggingface.co/KoboldAI/OPT-350M-Nerys-v2) model, I have never seen any other roleplay models this small. |
|
Since then, I discovered other small models such as [Lite-Oute-1-65M](https://huggingface.co/OuteAI/Lite-Oute-1-65M) and [SmolLM-135M](https://huggingface.co/HuggingFaceTB/SmolLM-135M). |
|
However, at that time, I don't know how to finetune. |
|
|
|
Thankfully, one friend of mine in real life shared me his ways of finetuning, and I decided to follow it. |
|
At first, I wanted to finetune [AMD-Llama-135m](https://huggingface.co/amd/AMD-Llama-135m), since no instruct finetunes of it exists yet. |
|
Then when I saw [HuggingFaceTB](https://huggingface.co/HuggingFaceTB) released [HuggingFaceTB/SmolLM2-135M](https://huggingface.co/HuggingFaceTB/SmolLM2-135M), I thought that it'd be a better candidate with its claimed 2 trillion token pretraining. |
|
And so here it is. A finetune of that model. |
|
|
|
## How it's made |
|
|
|
First, I take the base model of the SmolLM2-135M model. |
|
The base model already has the ChatML tokens added in, so that's the instruct template that will be used for this finetune. |
|
The model is finetuned on the first dataset for 3 epochs. |
|
The dataset is filtered to only include rows that fit in 2k context size, leaving only 9973 out of 10000 rows. |
|
|
|
For the second dataset, I found out almost all of the rows in the dataset would not fit in 2k context size. |
|
So the filter is set to 4k context size for the second dataset, leaving only 3730 out of 9736 rows used. |
|
I wrote a custom collator class for completion-only to ensure that the model learns to generate conversations properly. |
|
|
|
Consider this conversation example: |
|
``` |
|
...<|im_end|>\n<|im_start|>assistant\nMatilda The Koala: bla bla bla |
|
^ ^ <- the model should only be trained to generate from this point. |
|
^ <- but the built-in DataCollatorForCompletionOnlyLM class only allows us to train from this point . |
|
``` |
|
In this case, the model shouldn't learn to generate the character name, since most front-ends like SillyTavern already adds the character names. |
|
The custom collator class should try to fix that by masking tokens to the `:` character. |
|
However, this method is not completely bullet-proof. |
|
Suppose that the dataset contains a chat message that doesn't contain the character name but contains a `:` somewhere unrelated on the same chat message. |
|
And the collator will see that `:` and mask all the texts before it. |
|
|
|
With this method, the model is finetuned on the second dataset for only 2 epochs. |
|
|
|
All trainings are done in my own personal computer, running on one Nvidia RTX 3080 with 10 GB of VRAM. *\*mutters\** should have gotten myself a 3090 instead.... |
|
|
|
## Things to note |
|
|
|
- This model only contains a mere 135 million parameters, it's a terrible roleplay model. |
|
- One of the dataset contains NSFW data, but I doubt that the model is smart enough to generate such thing, so I don't think this would be a big problem on SFW roleplays. |
|
- Feedbacks are welcome! |