File size: 1,248 Bytes
cfd2a25 2121e0b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
---
license: apache-2.0
---
# fasttext cbow on dclm400
A continuous-bag-of-words model trained on https://huggingface.co/datasets/mlfoundations/dclm-pool-400m-1x
the cbow model was trained with https://github.com/facebookresearch/fastText/
the dataset was downloaded with git-lfs
the dataset commit was: f20ae752116ce7b4ab15d31e1e40b094229bf911
the files decompressed with:
`parallel "zstd --keep --stdout -d {} | jq .text > {/}.txt" ::: /root/lfs/dclm-pool-400m-1x/*.jsonl.zst`
concatenated with
`cat *.txt > CC_SHARD_ALL.jsonl.txt`
the `sha256sum CC_SHARD_ALL.json.txt` is
`576e4e79e76b9ca24dc77a8da0df17ad5efc9c5ca16c9a86f62e7b7b4ae8c640 CC_SHARD_ALL.jsonl.txt`
then the fasttext model was trained with defaults settings from
compiled with gcc 13.3.1
fasttext-repo (main branch) with the commit hash `1142dc4c4ecbc19cc16eee5cdd28472e689267e6`
training command:
`prlimit -m 3200000000 fasttext cbow -input CC_SHARD_ALL.jsonl.txt -output fasttext_models/model`
the exact fasttext binary is included in this repo as `fasttext`
the decompression and concatenating took a few hours.
the model training took 100 hours on 8 cores plus a few hours to read in the words (fasttext)
|