SerialKicked
/

ModelTestingBed

Model card Files Files and versions Community

ModelTestingBed / README.md

SerialKicked's picture

Update README.md

8615360 verified 7 months ago

|

history blame contribute delete

3.29 kB

	---
	language:
	- en
	tags:
	- testing
	- llm
	- rp
	- discussion
	---

	# Why? What? TL;DR?

	Simply put, I'm making my methodology to evaluate RP models public. While none of this is very scientific, it is consistent. I'm focusing on things I'm personally looking for in a model, like its ability to obey a character card and a system prompt accurately. Still, I think most of my tests are universal enough that other people might be interested in the results, or might want to run those tests on their own.


	# Testing Environment

	- Frontend is staging version of Silly Tavern.
	- Backend is the latest version of KoboldCPP for Windows using CUDA 12.
	- Using CuBLAS but not using QuantMatMul (mmq).
	- Fixed Seed for all tests: 123
	- 7-10B Models:
	- All models are loaded in Q8_0 (GGUF)
	- Flash Attention and ContextShift enabled.
	- All models are extended to 16K context length (auto rope from KCPP)
	- Response size set to 1024 tokens max.
	- 11-15B Models:
	- All models are loaded in Q4_KM or whatever is the highest/closest available (GGUF)
	- Flash Attention and 8Bit cache compression are enabled.
	- All models are extended to 12K context length (auto rope from KCPP)
	- Response size set to 512 tokens max.


	# System Prompt and Instruct Format

	- The exact system prompt and instruct format files can be found in the [file repository](https://huggingface.co/SerialKicked/ModelTestingBed/tree/main).
	- All models are tested in whichever instruct format they are supposed to be comfortable with (as long as it's ChatML or L3 Instruct)


	# Available Tests

	### DoggoEval

	The goal of this test, featuring a dog (Rex) and his owner (EsKa), is to determine if a model is good at obeying a system prompt and character card. The trick being that dogs can't talk, but LLM love to.

	- [Results and discussions are hosted in this thread](https://huggingface.co/SerialKicked/ModelTestingBed/discussions/1) ([old thread here](https://huggingface.co/LWDCLS/LLM-Discussions/discussions/13))
	- [Files, cards and settings can be found here](https://huggingface.co/SerialKicked/ModelTestingBed/tree/main/DoggoEval)
	- TODO: Charts and screenshots

	### MinotaurEval

	TODO: The goal of this test is to check if a model is able of following a very specific prompting method and maintain situational awareness in the smallest labyrinth in the world.

	- Discussions will be hosted here.
	- Files and cards will be available soon (tm).

	### TimeEval

	TODO: The goal of this test is to see if the bot is able to behave in 16K context, recall and summarise "old" info accurately.

	- Discussions will be hosted here.
	- Files and cards will be available soon (tm).


	# Limitations

	I'm testing for things I'm interested in. I do not pretend any of this is very scientific or accurate: as much as I try to reduce the amount of variables, a small LLM is still a small LLM at the end of the day. The results for other seeds, or with the smallest of change, are bound to give very different results.

	I usually give the different models I'm testing a fair shake in a more casual settings. I regen tons of outputs with random seeds, and while there are (large) variations, it tends to even out to the results shown in testing. Otherwise I'll make a note of it.