metadata

license: apache-2.0

fasttext cbow on dclm400

the dataset was downloaded with git-lfs

the dataset commit was: f20ae752116ce7b4ab15d31e1e40b094229bf911

the files decompressed with:

parallel "zstd --keep --stdout -d {} | jq .text > {/}.txt" ::: /root/lfs/dclm-pool-400m-1x/*.jsonl.zst

concatenated with

cat *.txt > CC_SHARD_ALL.jsonl.txt

the sha256sum CC_SHARD_ALL.json.txt is

576e4e79e76b9ca24dc77a8da0df17ad5efc9c5ca16c9a86f62e7b7b4ae8c640 CC_SHARD_ALL.jsonl.txt

then the fasttext model was trained with defaults settings from

compiled with gcc 13.3.1

fasttext-repo (main branch) with the commit hash 1142dc4c4ecbc19cc16eee5cdd28472e689267e6

training command:

prlimit -m 3200000000 fasttext cbow -input CC_SHARD_ALL.jsonl.txt -output fasttext_models/model

the exact fasttext binary is included in this repo as fasttext

the decompression and concatenating took a few hours.

the model training took 100 hours on 8 cores plus a few hours to read in the words (fasttext)