umarzein
/

silicone-dyda-16k-8k-tokenizer

Model card Files Files and versions Community

silicone-dyda-16k-8k-tokenizer / README.md

umarzein's picture

Update README.md

fd12ae6 over 1 year ago

|

history blame contribute delete

431 Bytes

	---
	license: mit
	datasets:
	- silicone
	---
	the "dataset" and/or the "datasets" in this repo refers to the first 16384 rows of `silicone`:`dyda_da`:`train` dataset

	trained over the gpt2 tokenizer, this tokenizer matches the avg #tokens/datapoint Using only 8192 vocab_size (from the base's 50257)


	```python
	import transformers
	tokenizer=transformers.GPT2TokenizerFast.from_pretrained("umarzein/silicone-dyda-16k-8k-tokenizer")
	```