remove commoncrawl section
Browse files- index.html +1 -105
index.html
CHANGED
@@ -700,111 +700,7 @@
|
|
700 |
</ul>
|
701 |
<p>To keep more tokens, we also experimented with a less strict threshold of 2 instead of 3. This approach preserved 4.5T tokens and still outperformed the FineWeb dataset, with performance just slightly below that of threshold 3.</p>
|
702 |
<p>We release these two datasets as FineWeb-Edu and FineWeb-edu-Large along with the classifier used for the filtering.</p>
|
703 |
-
<p><strong>TODO: add ablation results and dataset links, and maybe FineWeb-edu-smol</strong></p>
|
704 |
-
<h2>Just like fine wine, not all crawls are created
|
705 |
-
equal</h2>
|
706 |
-
<p>During our ablation runs, we observed that certain crawls
|
707 |
-
outperformed others by a significant margin. To investigate this phenomenon, we conducted 27B token runs for
|
708 |
-
each dump (we used the version with base filtering + ind dedup), with 2 trainings per dump, where each used
|
709 |
-
a different data subset. We trained 190 such models, totaling over 60k H100 GPU-hours. We subsequently took
|
710 |
-
the last 3 checkpoints for both seeds and plotted the average of these 6 data points per dump. </p>
|
711 |
-
<p>The plot below clearly shows that some dumps perform far
|
712 |
-
worse than others. Each year has a different color, and the number of crawls per year also changes.</p>
|
713 |
-
|
714 |
-
<div class="main-plot-container l-page">
|
715 |
-
<figure><img src="plots/score_by_dump.png"/></figure>
|
716 |
-
<div id="plot-score_by_dump"></div>
|
717 |
-
</div>
|
718 |
-
|
719 |
-
<p>We identified 5 main relevant time intervals:</p>
|
720 |
-
<ul>
|
721 |
-
<li>2013 to 2016: relatively stable, average quality</li>
|
722 |
-
</ul>
|
723 |
-
<ul>
|
724 |
-
<li>2017 to 2018: high quality, with a drop by the end of 2018</li>
|
725 |
-
</ul>
|
726 |
-
<ul>
|
727 |
-
<li>2019 to 2021: high quality, steadily increase</li>
|
728 |
-
</ul>
|
729 |
-
<ul>
|
730 |
-
<li>2021-49 and 2022: very large drop in performance, followed by worse quality
|
731 |
-
dumps
|
732 |
-
</li>
|
733 |
-
</ul>
|
734 |
-
<ul>
|
735 |
-
<li>2023 and 2024-10: almost exponential improvement. In particular, 2023-50
|
736 |
-
and 2024-10 are by far the best dumps
|
737 |
-
</li>
|
738 |
-
</ul>
|
739 |
-
<p>One possibility to improve performance when training
|
740 |
-
models on < 15T would be to train on FineWeb while excluding the worst quality CommonCrawl dumps.</p>
|
741 |
-
<p>We conducted further analysis to investigate the factors
|
742 |
-
causing these differences from dump to dump. In particular, we considered 3 potential causes: </p>
|
743 |
-
<ul>
|
744 |
-
<li>large sudden changes in the list of crawled URLs;</li>
|
745 |
-
</ul>
|
746 |
-
<ul>
|
747 |
-
<li>synthetic (LLM generated) data;</li>
|
748 |
-
</ul>
|
749 |
-
<ul>
|
750 |
-
<li>benchmark contamination;</li>
|
751 |
-
</ul>
|
752 |
-
<p>We go over each one in the following sections.</p>
|
753 |
-
<h3>Changes in the most frequent URLs [HAVE TO RECHECK]</h3>
|
754 |
-
<p>For each crawl from 2021-10 onwards, we gathered a list of
|
755 |
-
the 60k most frequent <strong>FQDNs</strong> (fully qualified domain name). We then calculated the <a
|
756 |
-
href="https://en.wikipedia.org/wiki/Jaccard_index">Jaccard similarity</a><d-cite bibtex-key="jaccard1912distribution"></d-cite> between consecutive
|
757 |
-
crawls. A high value means that a crawl/dump has many of the same FQDNs as the dump immediately preceding
|
758 |
-
it, while a small value means that a considerable number of top 60k FQDNs were downsampled or removed, or
|
759 |
-
that alternatively new FQDNs were added to the top 60k.</p>
|
760 |
-
<figure><img src="plots/Untitled%204.png"/></figure>
|
761 |
-
<p>The data indicates three significant changes:
|
762 |
-
2021-43/2021-49, 2022-33/2022-40, and 2023-40/2023-50.</p>
|
763 |
-
<p>The explanation for the changes between 2022-33/2022-40
|
764 |
-
and 2023-40/2023-50 is straightforward: CommonCrawl accidentally did not index several popular suffixes,
|
765 |
-
such as .co.uk, as documented on <a href="https://commoncrawl.org/errata/co-uk-cctld-not-included">this
|
766 |
-
erratum</a>. This particular change does not seem particularly correlated on the overall dump quality.
|
767 |
-
</p>
|
768 |
-
<p>As to the shift from 2021-43 to 2021-49, which coincides
|
769 |
-
with a sharp performance drop, roughly half (~30k) of the former’s top 60k FQDNs are not present in the
|
770 |
-
latter’s list of top 60k FQDNs, and the dump size itself also decreased (19% reduction in WARC size, and a
|
771 |
-
28% token reduction after deduplication). </p>
|
772 |
-
<p>We were unable to find a clear reason for this drastic
|
773 |
-
change, but upon reaching out to CommonCrawl, we were informed that these differences likely stem from a
|
774 |
-
major update in adult content and malicious site blocking. It is therefore possible that the new updated
|
775 |
-
adult site filter could have also removed a high number of high quality domains resulting in poor
|
776 |
-
performance of the crawl. <strong>[TODO: change this framing a bit, it seems to suggest adult content is
|
777 |
-
high quality for LLMs]</strong></p>
|
778 |
-
<h3>Synthetic data contamination [HAVE TO RECHECK]</h3>
|
779 |
-
<p>Secondly, we wondered if part of the changes in
|
780 |
-
performance on recent dumps could be attributed to the presence of a larger quantity of synthetic data (data
|
781 |
-
generated by LLMs). Such a change would not be surprising due to the recent increase in popularity of LLMs,
|
782 |
-
notably of ChatGPT.</p>
|
783 |
-
<p>Since, to the best of our knowledge, there is no fool
|
784 |
-
proof method to detect synthetic data, we opted to use a proxy metric: we measured the frequency of the
|
785 |
-
following words: <code>delve, as a large language model, it's important to note, rich tapestry,
|
786 |
-
intertwined, certainly!, dive into</code>, which are words commonly used by ChatGPT.</p>
|
787 |
-
<p>It is important to note that not all samples containing
|
788 |
-
one of these phrases were necessarily generated by ChatGPT (and also that many ChatGPT generated samples do
|
789 |
-
not contain any of these phrases), but assuming that the amount of synthetic data were to not change across
|
790 |
-
dumps, one would expect these frequencies to remain approximately constant over time.</p>
|
791 |
-
<p>The results are shown in the following graph:</p>
|
792 |
-
<figure><img src="plots/Untitled%205.png"/></figure>
|
793 |
-
<p>While the frequency remained approximately constant until
|
794 |
-
2023-14 (ChatGPT was released at the end of 2022), not only do we find a steep increase of our proxy metric
|
795 |
-
in recent crawls, as the proxy metric also correlates well with the agg score, with a pearson correlation of
|
796 |
-
<strong>0.590</strong>. It is therefore possible that synthetic data has positively impacted performance in
|
797 |
-
our selected tasks for these most recent dumps (with all limitations in interpretation from a single
|
798 |
-
correlation measurement without intervention of randomization or any causality tools being used here). In
|
799 |
-
particular, it could explain why the 2023-50 and 2024-10 dumps have such a strong performance. </p>
|
800 |
-
<h3>Benchmarks contamination [HAVE TO RECHECK]</h3>
|
801 |
-
<p>Also, most of our used benchmarks were introduced around
|
802 |
-
<strong>2019</strong>. It’s thus possible that the 2019-XX 2021-43 performance increase might be caused by
|
803 |
-
higher benchmark contamination in those crawls. Similarly, the recent increase in LLM popularity and
|
804 |
-
evaluations, might have increased the contamination in recent benchmarks, explaining the score improvements
|
805 |
-
of the two most recent crawls. <strong>[NOTE: the plot does not seem to support this at all]</strong></p>
|
806 |
-
|
807 |
-
<figure><img src="plots/Untitled%206.png"/></figure>
|
808 |
<h2>Next steps</h2>
|
809 |
<p>We want to continue improving FineWeb and will also
|
810 |
release a technical report with more details soon.</p>
|
|
|
700 |
</ul>
|
701 |
<p>To keep more tokens, we also experimented with a less strict threshold of 2 instead of 3. This approach preserved 4.5T tokens and still outperformed the FineWeb dataset, with performance just slightly below that of threshold 3.</p>
|
702 |
<p>We release these two datasets as FineWeb-Edu and FineWeb-edu-Large along with the classifier used for the filtering.</p>
|
703 |
+
<p><strong>TODO: add ablation results and dataset links, and maybe FineWeb-edu-smol</strong></p>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
704 |
<h2>Next steps</h2>
|
705 |
<p>We want to continue improving FineWeb and will also
|
706 |
release a technical report with more details soon.</p>
|