alasdairforsythe/tokenmonster

TokenMonster

The documentation and code is available on Github alasdairforsythe/tokenmonster.

The pretrained vocabularies are all available for download here.

July 11: TokenMonster v1.1.1 has been released. The "420" prebuilt vocabularies are being released as they are completed, at a rate of around 10 per day.

Choose a dataset from:

code
english
englishcode
fiction

Choose a vocab size from:

1024
2048
4096
8000
16000
24000
32000
40000
50256
65536
100256

Choose an optimization mode from:

unfiltered
clean
balanced
consistent
strict

For a capcode disabled vocabulary add:

nocapcode

And finally add the version number:

v1

Examples:

fiction-24000-consistent-v1
code-4096-clean-nocapcode-v1

There are two additional vocabularies:

gpt2
llama