SmolRP-135M-v0.9 / README.md

Update README.md

c425d08 verified 3 months ago

3.89 kB

	---
	license: apache-2.0
	datasets:
	- Gryphe/Sonnet3.5-Charcard-Roleplay
	- Doctor-Shotgun/no-robots-sharegpt
	language:
	- en
	base_model:
	- HuggingFaceTB/SmolLM2-135M
	base_model_relation: finetune
	pipeline_tag: text-generation
	---
	Just a little bit of experiment and practice to see whether a 135-million parameter model can be finetuned for roleplay.

	# SmolRP-135M-v0.9!

	A finetune of [HuggingFaceTB/SmolLM2-135M](https://huggingface.co/HuggingFaceTB/SmolLM2-135M),
	first on the [Doctor-Shotgun/no-robots-sharegpt](https://huggingface.co/datasets/Doctor-Shotgun/no-robots-sharegpt),
	then on some of the [Gryphe/Sonnet3.5-Charcard-Roleplay](https://huggingface.co/datasets/Gryphe/Sonnet3.5-Charcard-Roleplay).

	## Why it's made

	I have always been fascinated with smaller models with less than 1 billion parameters.
	Ever since the [KoboldAI/OPT-350M-Nerys-v2](https://huggingface.co/KoboldAI/OPT-350M-Nerys-v2) model, I have never seen any other roleplay models this small.
	Since then, I discovered other small models such as [Lite-Oute-1-65M](https://huggingface.co/OuteAI/Lite-Oute-1-65M) and [SmolLM-135M](https://huggingface.co/HuggingFaceTB/SmolLM-135M).
	However, at that time, I don't know how to finetune.

	Thankfully, one friend of mine in real life shared me his ways of finetuning, and I decided to follow it.
	At first, I wanted to finetune [AMD-Llama-135m](https://huggingface.co/amd/AMD-Llama-135m), since no instruct finetunes of it exists yet.
	Then when I saw [HuggingFaceTB](https://huggingface.co/HuggingFaceTB) released [HuggingFaceTB/SmolLM2-135M](https://huggingface.co/HuggingFaceTB/SmolLM2-135M), I thought that it'd be a better candidate with its claimed 2 trillion token pretraining.
	And so here it is. A finetune of that model.

	## How it's made

	First, I take the base model of the SmolLM2-135M model.
	The base model already has the ChatML tokens added in, so that's the instruct template that will be used for this finetune.
	The model is finetuned on the first dataset for 3 epochs.
	The dataset is filtered to only include rows that fit in 2k context size, leaving only 9973 out of 10000 rows.

	For the second dataset, I found out almost all of the rows in the dataset would not fit in 2k context size.
	So the filter is set to 4k context size for the second dataset, leaving only 3730 out of 9736 rows used.
	I wrote a custom collator class for completion-only to ensure that the model learns to generate conversations properly.

	Consider this conversation example:
	```
	...<\|im_end\|>\n<\|im_start\|>assistant\nMatilda The Koala: bla bla bla
	^ ^ <- the model should only be trained to generate from this point.
	^ <- but the built-in DataCollatorForCompletionOnlyLM class only allows us to train from this point .
	```
	In this case, the model shouldn't learn to generate the character name, since most front-ends like SillyTavern already adds the character names.
	The custom collator class should try to fix that by masking tokens to the `:` character.
	However, this method is not completely bullet-proof.
	Suppose that the dataset contains a chat message that doesn't contain the character name but contains a `:` somewhere unrelated on the same chat message.
	And the collator will see that `:` and mask all the texts before it.

	With this method, the model is finetuned on the second dataset for only 2 epochs.

	All trainings are done in my own personal computer, running on one Nvidia RTX 3080 with 10 GB of VRAM. \mutters\** should have gotten myself a 3090 instead....

	## Things to note

	- This model only contains a mere 135 million parameters, it's a terrible roleplay model.
	- One of the dataset contains NSFW data, but I doubt that the model is smart enough to generate such thing, so I don't think this would be a big problem on SFW roleplays.
	- Feedbacks are welcome!