π§ͺ FineWeb v1 data experiments
Ablation models trained for our data experiments.
Text Generation β’ Updated β’ 14Note Ablation trained to compare warc+trafilatura text extraction with the default wet extraction from CommonCrawl [28BT]
HuggingFaceFW/ablation-exp-textext-wet-28BT
Text Generation β’ Updated β’ 10Note Ablation trained to compare warc+trafilatura text extraction with the default wet extraction from CommonCrawl [28BT]
HuggingFaceFW/ablation-exp-fw-base_filtering-350BT
Text Generation β’ Updated β’ 10Note Trained on all CommonCrawl dumps after text extraction and our base filtering [350BT]
HuggingFaceFW/ablation-exp-dedup-global_minhash-350BT
Text Generation β’ Updated β’ 14Note Trained on all CommonCrawl dumps after global MinHash deduplication [350BT]
HuggingFaceFW/ablation-exp-dedup-independent_minhash-350BT
Text Generation β’ Updated β’ 10Note Trained on all CommonCrawl dumps after independent MinHash deduplication [350BT]
HuggingFaceFW/ablation-exp-dedup-ind_mh-global_line-350BT
Text Generation β’ Updated β’ 13Note Trained on all CommonCrawl dumps after independent MinHash deduplication followed by line dedup [350BT]
HuggingFaceFW/ablation-exp-dedup-ind_mh-global_line_minwords-350BT
Text Generation β’ Updated β’ 12Note Trained on all CommonCrawl dumps after independent MinHash deduplication followed by line dedup w/ min words [350BT]
HuggingFaceFW/ablation-exp-dedup-ind_mh-global_3line-350BT
Text Generation β’ Updated β’ 532Note Trained on all CommonCrawl dumps after independent MinHash deduplication followed by line dedup on ranges of 3 lines [350BT]
HuggingFaceFW/ablation-exp-filter-baseline_cc-28BT
Text Generation β’ Updated β’ 11 β’ 4Note Filtering baseline CC: trained on the 2019-18 dump after base filtering and independent minhash [28BT]
HuggingFaceFW/ablation-exp-filter-baseline_c4-28BT
Text Generation β’ Updated β’ 16 β’ 2Note Filtering baseline C4: C4 on 28BT (C4 is also based on the 2019-18 dump) [28BT]
HuggingFaceFW/ablation-exp-filter-c4-word_lengths-28BT
Text Generation β’ Updated β’ 10 β’ 2Note Filtering baseline CC + C4 word lengths filter [28BT]
HuggingFaceFW/ablation-exp-filter-c4-tpunct-28BT
Text Generation β’ Updated β’ 10 β’ 1Note Filtering baseline CC + C4 terminal punctuation filter [28BT]
HuggingFaceFW/ablation-exp-filter-c4-curly_bracket-28BT
Text Generation β’ Updated β’ 12Note Filtering baseline CC + C4 curly bracket filter [28BT]
HuggingFaceFW/ablation-exp-filter-c4-all-28BT
Text Generation β’ Updated β’ 10Note Filtering baseline CC + C4 all filters [28BT]
HuggingFaceFW/ablation-exp-filter-c4-all_except_tpunct-28BT
Text Generation β’ Updated β’ 14Note Filtering baseline CC + C4 all filters except terminal punct [28BT]
HuggingFaceFW/ablation-exp-fw-base_filt-ind_mh-c4_filters-350BT
Text Generation β’ Updated β’ 12Note Larger validation run: all dumps with base filtering, independent minhash and c4 filters except terminal punct [350BT]
HuggingFaceFW/ablation-exp-filter-custom-lines_punct_0.12-28BT
Text Generation β’ Updated β’ 13 β’ 3Note Filtering baseline CC + custom lines punctuation filter [28BT]
HuggingFaceFW/ablation-exp-filter-custom-line_char_duplicated_0.01-28BT
Text Generation β’ Updated β’ 10 β’ 2Note Filtering baseline CC + custom lines duplicated characters filter [28BT]
HuggingFaceFW/ablation-exp-filter-custom-line_ratio_0.67-28BT
Text Generation β’ Updated β’ 17Note Filtering baseline CC + custom short lines filter [28BT]
HuggingFaceFW/ablation-exp-filter-custom-all_filters-28BT
Text Generation β’ Updated β’ 10 β’ 1Note Filtering baseline CC + all 3 custom filters [28BT]
HuggingFaceFW/ablation-model-fineweb-v1
Text Generation β’ Updated β’ 1.13k β’ 13Note Larger validation run with the FINAL DATA: all dumps with base filtering, independent minhash, c4 filters except terminal punct, and custom filters [350BT]