license: mit | |
datasets: | |
- silicone | |
the "dataset" and/or the "datasets" in this repo refers to the first 16384 rows of `silicone`:`dyda_da`:`train` dataset | |
trained over the gpt2 tokenizer, this tokenizer matches the avg #tokens/datapoint Using only 8192 vocab_size (from the base's 50257) | |
```python | |
import transformers | |
tokenizer=transformers.GPT2TokenizerFast.from_pretrained("umarzein/silicone-dyda-16k-8k-tokenizer") | |
``` |