Spaces:

LDJA
/

new2

Sleeping

App Files Files Community

new2 / data /openwebtext /readme.md

LDJA's picture

one

234828b about 1 year ago

|

489 Bytes

openwebtext dataset

after running prepare.py (preprocess) we get:

train.bin is ~17GB, val.bin ~8.5MB
train has ~9B tokens (9,035,582,198)
val has ~4M tokens (4,434,897)

this came from 8,013,769 documents in total.

references:

OpenAI's WebText dataset is discussed in GPT-2 paper
OpenWebText dataset