HuggingFaceGECLM
/

mix_tok_v1

Model card Files Files and versions Community

mix_tok_v1 / README.md

teven's picture

Update README.md

f45ec5f over 1 year ago

|

No virus

389 Bytes

	---
	language:
	- en
	---

	V1 of an English/code tokenizer. Equal mix between:
	On the NL side:
	- Books
	- C4
	- v1 of our CC (helen quality classifier)
	- enwiki
	- Gutenberg
	- Reddit

	On the code side:
	- Jupyter notebooks (0.5 weight, it was small)
	- GH issues
	- Stackexchange
	- The cleaned Python Stack

	For a total of 1/3 code data (although there is a lot of English in Stackexchange and GH).