README.md · HuggingFaceGECLM/mix_tok_v1 at 0e15c528cc0a46c74f73131db997299b38e2c15d

metadata

language:
  - en

V1 of an English/code tokenizer. Equal mix between: On the NL side:

On the code side:

Jupyter notebooks (0.5 weight, it was small)
GH issues
Stackexchange
The cleaned Python Stack For a total of 1/3 code data (although there is a lot of English in Stackexchange and GH).