Spaces:

ibrim
/

NanoGPT

Running

App Files Files Community

NanoGPT / data /openwebtext /readme.md

ibrim's picture

Upload 10 files

6cf1d95 verified 22 days ago

|

raw history blame contribute delete

No virus

489 Bytes


	## openwebtext dataset

	after running `prepare.py` (preprocess) we get:

	- train.bin is ~17GB, val.bin ~8.5MB
	- train has ~9B tokens (9,035,582,198)
	- val has ~4M tokens (4,434,897)

	this came from 8,013,769 documents in total.

	references:

	- OpenAI's WebText dataset is discussed in [GPT-2 paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
	- [OpenWebText](https://skylion007.github.io/OpenWebTextCorpus/) dataset