File size: 431 Bytes
7f9e505 f93bfa2 fd12ae6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
---
license: mit
datasets:
- silicone
---
the "dataset" and/or the "datasets" in this repo refers to the first 16384 rows of `silicone`:`dyda_da`:`train` dataset
trained over the gpt2 tokenizer, this tokenizer matches the avg #tokens/datapoint Using only 8192 vocab_size (from the base's 50257)
```python
import transformers
tokenizer=transformers.GPT2TokenizerFast.from_pretrained("umarzein/silicone-dyda-16k-8k-tokenizer")
``` |