flowpoint
/

fasttext_cbow_dclm400

Model card Files Files and versions Community

fasttext_cbow_dclm400 / README.md

flowpoint's picture

add fasttext model, fasttext binary and readme

2121e0b 2 months ago

|

history blame contribute delete

1.25 kB

	---
	license: apache-2.0
	---

	# fasttext cbow on dclm400

	A continuous-bag-of-words model trained on https://huggingface.co/datasets/mlfoundations/dclm-pool-400m-1x

	the cbow model was trained with https://github.com/facebookresearch/fastText/

	the dataset was downloaded with git-lfs

	the dataset commit was: f20ae752116ce7b4ab15d31e1e40b094229bf911

	the files decompressed with:

	`parallel "zstd --keep --stdout -d {} \| jq .text > {/}.txt" ::: /root/lfs/dclm-pool-400m-1x/*.jsonl.zst`

	concatenated with

	`cat *.txt > CC_SHARD_ALL.jsonl.txt`

	the `sha256sum CC_SHARD_ALL.json.txt` is

	`576e4e79e76b9ca24dc77a8da0df17ad5efc9c5ca16c9a86f62e7b7b4ae8c640 CC_SHARD_ALL.jsonl.txt`

	then the fasttext model was trained with defaults settings from

	compiled with gcc 13.3.1

	fasttext-repo (main branch) with the commit hash `1142dc4c4ecbc19cc16eee5cdd28472e689267e6`

	training command:

	`prlimit -m 3200000000 fasttext cbow -input CC_SHARD_ALL.jsonl.txt -output fasttext_models/model`

	the exact fasttext binary is included in this repo as `fasttext`

	the decompression and concatenating took a few hours.

	the model training took 100 hours on 8 cores plus a few hours to read in the words (fasttext)