Spaces:

HuggingFaceFW
/

blogpost-fineweb-v1

Running

App Files Files Community

guipenedo HF staff commited on May 30, 2024

Commit

c85811e

1 Parent(s): efb8c33

remove commoncrawl section

Browse files

Files changed (1) hide show

index.html +1 -105

index.html CHANGED Viewed

@@ -700,111 +700,7 @@
     </ul>
     <p>To keep more tokens, we also experimented with a less strict threshold of 2 instead of 3. This approach preserved 4.5T tokens and still outperformed the FineWeb dataset, with performance just slightly below that of threshold 3.</p>
     <p>We release these two datasets as FineWeb-Edu and FineWeb-edu-Large along with the classifier used for the filtering.</p>
-    <p><strong>TODO: add ablation results and dataset links, and maybe FineWeb-edu-smol</strong></p>
-    <h2>Just like fine wine, not all crawls are created
-        equal</h2>
-    <p>During our ablation runs, we observed that certain crawls
-        outperformed others by a significant margin. To investigate this phenomenon, we conducted 27B token runs for
-        each dump (we used the version with base filtering + ind dedup), with 2 trainings per dump, where each used
-        a different data subset. We trained 190 such models, totaling over 60k H100 GPU-hours. We subsequently took
-        the last 3 checkpoints for both seeds and plotted the average of these 6 data points per dump. </p>
-    <p>The plot below clearly shows that some dumps perform far
-        worse than others. Each year has a different color, and the number of crawls per year also changes.</p>
-    <div class="main-plot-container l-page">
-        <figure><img src="plots/score_by_dump.png"/></figure>
-        <div id="plot-score_by_dump"></div>
-    </div>
-    <p>We identified 5 main relevant time intervals:</p>
-    <ul>
-        <li>2013 to 2016: relatively stable, average quality</li>
-    </ul>
-    <ul>
-        <li>2017 to 2018: high quality, with a drop by the end of 2018</li>
-    </ul>
-    <ul>
-        <li>2019 to 2021: high quality, steadily increase</li>
-    </ul>
-    <ul>
-        <li>2021-49 and 2022: very large drop in performance, followed by worse quality
-            dumps
-        </li>
-    </ul>
-    <ul>
-        <li>2023 and 2024-10: almost exponential improvement. In particular, 2023-50
-            and 2024-10 are by far the best dumps
-        </li>
-    </ul>
-    <p>One possibility to improve performance when training
-        models on &lt; 15T would be to train on FineWeb while excluding the worst quality CommonCrawl dumps.</p>
-    <p>We conducted further analysis to investigate the factors
-        causing these differences from dump to dump. In particular, we considered 3 potential causes: </p>
-    <ul>
-        <li>large sudden changes in the list of crawled URLs;</li>
-    </ul>
-    <ul>
-        <li>synthetic (LLM generated) data;</li>
-    </ul>
-    <ul>
-        <li>benchmark contamination;</li>
-    </ul>
-    <p>We go over each one in the following sections.</p>
-    <h3>Changes in the most frequent URLs [HAVE TO RECHECK]</h3>
-    <p>For each crawl from 2021-10 onwards, we gathered a list of
-        the 60k most frequent <strong>FQDNs</strong> (fully qualified domain name). We then calculated the <a
-                href="https://en.wikipedia.org/wiki/Jaccard_index">Jaccard similarity</a><d-cite bibtex-key="jaccard1912distribution"></d-cite> between consecutive
-        crawls. A high value means that a crawl/dump has many of the same FQDNs as the dump immediately preceding
-        it, while a small value means that a considerable number of top 60k FQDNs were downsampled or removed, or
-        that alternatively new FQDNs were added to the top 60k.</p>
-    <figure><img src="plots/Untitled%204.png"/></figure>
-    <p>The data indicates three significant changes:
-        2021-43/2021-49, 2022-33/2022-40, and 2023-40/2023-50.</p>
-    <p>The explanation for the changes between 2022-33/2022-40
-        and 2023-40/2023-50 is straightforward: CommonCrawl accidentally did not index several popular suffixes,
-        such as .co.uk, as documented on <a href="https://commoncrawl.org/errata/co-uk-cctld-not-included">this
-            erratum</a>. This particular change does not seem particularly correlated on the overall dump quality.
-    </p>
-    <p>As to the shift from 2021-43 to 2021-49, which coincides
-        with a sharp performance drop, roughly half (~30k) of the former’s top 60k FQDNs are not present in the
-        latter’s list of top 60k FQDNs, and the dump size itself also decreased (19% reduction in WARC size, and a
-        28% token reduction after deduplication). </p>
-    <p>We were unable to find a clear reason for this drastic
-        change, but upon reaching out to CommonCrawl, we were informed that these differences likely stem from a
-        major update in adult content and malicious site blocking. It is therefore possible that the new updated
-        adult site filter could have also removed a high number of high quality domains resulting in poor
-        performance of the crawl. <strong>[TODO: change this framing a bit, it seems to suggest adult content is
-            high quality for LLMs]</strong></p>
-    <h3>Synthetic data contamination [HAVE TO RECHECK]</h3>
-    <p>Secondly, we wondered if part of the changes in
-        performance on recent dumps could be attributed to the presence of a larger quantity of synthetic data (data
-        generated by LLMs). Such a change would not be surprising due to the recent increase in popularity of LLMs,
-        notably of ChatGPT.</p>
-    <p>Since, to the best of our knowledge, there is no fool
-        proof method to detect synthetic data, we opted to use a proxy metric: we measured the frequency of the
-        following words: <code>delve, as a large language model, it&#x27;s important to note, rich tapestry,
-            intertwined, certainly!, dive into</code>, which are words commonly used by ChatGPT.</p>
-    <p>It is important to note that not all samples containing
-        one of these phrases were necessarily generated by ChatGPT (and also that many ChatGPT generated samples do
-        not contain any of these phrases), but assuming that the amount of synthetic data were to not change across
-        dumps, one would expect these frequencies to remain approximately constant over time.</p>
-    <p>The results are shown in the following graph:</p>
-    <figure><img src="plots/Untitled%205.png"/></figure>
-    <p>While the frequency remained approximately constant until
-        2023-14 (ChatGPT was released at the end of 2022), not only do we find a steep increase of our proxy metric
-        in recent crawls, as the proxy metric also correlates well with the agg score, with a pearson correlation of
-        <strong>0.590</strong>. It is therefore possible that synthetic data has positively impacted performance in
-        our selected tasks for these most recent dumps (with all limitations in interpretation from a single
-        correlation measurement without intervention of randomization or any causality tools being used here). In
-        particular, it could explain why the 2023-50 and 2024-10 dumps have such a strong performance. </p>
-    <h3>Benchmarks contamination [HAVE TO RECHECK]</h3>
-    <p>Also, most of our used benchmarks were introduced around
-        <strong>2019</strong>. It’s thus possible that the 2019-XX 2021-43 performance increase might be caused by
-        higher benchmark contamination in those crawls. Similarly, the recent increase in LLM popularity and
-        evaluations, might have increased the contamination in recent benchmarks, explaining the score improvements
-        of the two most recent crawls. <strong>[NOTE: the plot does not seem to support this at all]</strong></p>
-    <figure><img src="plots/Untitled%206.png"/></figure>
     <h2>Next steps</h2>
     <p>We want to continue improving FineWeb and will also
         release a technical report with more details soon.</p>

     </ul>
     <p>To keep more tokens, we also experimented with a less strict threshold of 2 instead of 3. This approach preserved 4.5T tokens and still outperformed the FineWeb dataset, with performance just slightly below that of threshold 3.</p>
     <p>We release these two datasets as FineWeb-Edu and FineWeb-edu-Large along with the classifier used for the filtering.</p>
+    <p><strong>TODO: add ablation results and dataset links, and maybe FineWeb-edu-smol</strong></p>
     <h2>Next steps</h2>
     <p>We want to continue improving FineWeb and will also
         release a technical report with more details soon.</p>