--- license: mit language: - en tags: - text generation datasets: - fhswf/TinyStoriesV2_cleaned --- BPE Tokenizer for TinyStoriesV2 --- Based on get-neo BPE Tokenizer, but with a smaller vocabulary. Trained with TinyStoriesV2. - Vocab Size: 4096 - 256 Base chars - 1 extra Token: <|endoftext|> - 3839 merges