guipenedo HF staff commited on
Commit
c85811e
·
1 Parent(s): efb8c33

remove commoncrawl section

Browse files
Files changed (1) hide show
  1. index.html +1 -105
index.html CHANGED
@@ -700,111 +700,7 @@
700
  </ul>
701
  <p>To keep more tokens, we also experimented with a less strict threshold of 2 instead of 3. This approach preserved 4.5T tokens and still outperformed the FineWeb dataset, with performance just slightly below that of threshold 3.</p>
702
  <p>We release these two datasets as FineWeb-Edu and FineWeb-edu-Large along with the classifier used for the filtering.</p>
703
- <p><strong>TODO: add ablation results and dataset links, and maybe FineWeb-edu-smol</strong></p>
704
- <h2>Just like fine wine, not all crawls are created
705
- equal</h2>
706
- <p>During our ablation runs, we observed that certain crawls
707
- outperformed others by a significant margin. To investigate this phenomenon, we conducted 27B token runs for
708
- each dump (we used the version with base filtering + ind dedup), with 2 trainings per dump, where each used
709
- a different data subset. We trained 190 such models, totaling over 60k H100 GPU-hours. We subsequently took
710
- the last 3 checkpoints for both seeds and plotted the average of these 6 data points per dump. </p>
711
- <p>The plot below clearly shows that some dumps perform far
712
- worse than others. Each year has a different color, and the number of crawls per year also changes.</p>
713
-
714
- <div class="main-plot-container l-page">
715
- <figure><img src="plots/score_by_dump.png"/></figure>
716
- <div id="plot-score_by_dump"></div>
717
- </div>
718
-
719
- <p>We identified 5 main relevant time intervals:</p>
720
- <ul>
721
- <li>2013 to 2016: relatively stable, average quality</li>
722
- </ul>
723
- <ul>
724
- <li>2017 to 2018: high quality, with a drop by the end of 2018</li>
725
- </ul>
726
- <ul>
727
- <li>2019 to 2021: high quality, steadily increase</li>
728
- </ul>
729
- <ul>
730
- <li>2021-49 and 2022: very large drop in performance, followed by worse quality
731
- dumps
732
- </li>
733
- </ul>
734
- <ul>
735
- <li>2023 and 2024-10: almost exponential improvement. In particular, 2023-50
736
- and 2024-10 are by far the best dumps
737
- </li>
738
- </ul>
739
- <p>One possibility to improve performance when training
740
- models on &lt; 15T would be to train on FineWeb while excluding the worst quality CommonCrawl dumps.</p>
741
- <p>We conducted further analysis to investigate the factors
742
- causing these differences from dump to dump. In particular, we considered 3 potential causes: </p>
743
- <ul>
744
- <li>large sudden changes in the list of crawled URLs;</li>
745
- </ul>
746
- <ul>
747
- <li>synthetic (LLM generated) data;</li>
748
- </ul>
749
- <ul>
750
- <li>benchmark contamination;</li>
751
- </ul>
752
- <p>We go over each one in the following sections.</p>
753
- <h3>Changes in the most frequent URLs [HAVE TO RECHECK]</h3>
754
- <p>For each crawl from 2021-10 onwards, we gathered a list of
755
- the 60k most frequent <strong>FQDNs</strong> (fully qualified domain name). We then calculated the <a
756
- href="https://en.wikipedia.org/wiki/Jaccard_index">Jaccard similarity</a><d-cite bibtex-key="jaccard1912distribution"></d-cite> between consecutive
757
- crawls. A high value means that a crawl/dump has many of the same FQDNs as the dump immediately preceding
758
- it, while a small value means that a considerable number of top 60k FQDNs were downsampled or removed, or
759
- that alternatively new FQDNs were added to the top 60k.</p>
760
- <figure><img src="plots/Untitled%204.png"/></figure>
761
- <p>The data indicates three significant changes:
762
- 2021-43/2021-49, 2022-33/2022-40, and 2023-40/2023-50.</p>
763
- <p>The explanation for the changes between 2022-33/2022-40
764
- and 2023-40/2023-50 is straightforward: CommonCrawl accidentally did not index several popular suffixes,
765
- such as .co.uk, as documented on <a href="https://commoncrawl.org/errata/co-uk-cctld-not-included">this
766
- erratum</a>. This particular change does not seem particularly correlated on the overall dump quality.
767
- </p>
768
- <p>As to the shift from 2021-43 to 2021-49, which coincides
769
- with a sharp performance drop, roughly half (~30k) of the former’s top 60k FQDNs are not present in the
770
- latter’s list of top 60k FQDNs, and the dump size itself also decreased (19% reduction in WARC size, and a
771
- 28% token reduction after deduplication). </p>
772
- <p>We were unable to find a clear reason for this drastic
773
- change, but upon reaching out to CommonCrawl, we were informed that these differences likely stem from a
774
- major update in adult content and malicious site blocking. It is therefore possible that the new updated
775
- adult site filter could have also removed a high number of high quality domains resulting in poor
776
- performance of the crawl. <strong>[TODO: change this framing a bit, it seems to suggest adult content is
777
- high quality for LLMs]</strong></p>
778
- <h3>Synthetic data contamination [HAVE TO RECHECK]</h3>
779
- <p>Secondly, we wondered if part of the changes in
780
- performance on recent dumps could be attributed to the presence of a larger quantity of synthetic data (data
781
- generated by LLMs). Such a change would not be surprising due to the recent increase in popularity of LLMs,
782
- notably of ChatGPT.</p>
783
- <p>Since, to the best of our knowledge, there is no fool
784
- proof method to detect synthetic data, we opted to use a proxy metric: we measured the frequency of the
785
- following words: <code>delve, as a large language model, it&#x27;s important to note, rich tapestry,
786
- intertwined, certainly!, dive into</code>, which are words commonly used by ChatGPT.</p>
787
- <p>It is important to note that not all samples containing
788
- one of these phrases were necessarily generated by ChatGPT (and also that many ChatGPT generated samples do
789
- not contain any of these phrases), but assuming that the amount of synthetic data were to not change across
790
- dumps, one would expect these frequencies to remain approximately constant over time.</p>
791
- <p>The results are shown in the following graph:</p>
792
- <figure><img src="plots/Untitled%205.png"/></figure>
793
- <p>While the frequency remained approximately constant until
794
- 2023-14 (ChatGPT was released at the end of 2022), not only do we find a steep increase of our proxy metric
795
- in recent crawls, as the proxy metric also correlates well with the agg score, with a pearson correlation of
796
- <strong>0.590</strong>. It is therefore possible that synthetic data has positively impacted performance in
797
- our selected tasks for these most recent dumps (with all limitations in interpretation from a single
798
- correlation measurement without intervention of randomization or any causality tools being used here). In
799
- particular, it could explain why the 2023-50 and 2024-10 dumps have such a strong performance. </p>
800
- <h3>Benchmarks contamination [HAVE TO RECHECK]</h3>
801
- <p>Also, most of our used benchmarks were introduced around
802
- <strong>2019</strong>. It’s thus possible that the 2019-XX 2021-43 performance increase might be caused by
803
- higher benchmark contamination in those crawls. Similarly, the recent increase in LLM popularity and
804
- evaluations, might have increased the contamination in recent benchmarks, explaining the score improvements
805
- of the two most recent crawls. <strong>[NOTE: the plot does not seem to support this at all]</strong></p>
806
-
807
- <figure><img src="plots/Untitled%206.png"/></figure>
808
  <h2>Next steps</h2>
809
  <p>We want to continue improving FineWeb and will also
810
  release a technical report with more details soon.</p>
 
700
  </ul>
701
  <p>To keep more tokens, we also experimented with a less strict threshold of 2 instead of 3. This approach preserved 4.5T tokens and still outperformed the FineWeb dataset, with performance just slightly below that of threshold 3.</p>
702
  <p>We release these two datasets as FineWeb-Edu and FineWeb-edu-Large along with the classifier used for the filtering.</p>
703
+ <p><strong>TODO: add ablation results and dataset links, and maybe FineWeb-edu-smol</strong></p>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
704
  <h2>Next steps</h2>
705
  <p>We want to continue improving FineWeb and will also
706
  release a technical report with more details soon.</p>