victormiller commited on
Commit
0cf4aa2
·
verified ·
1 Parent(s): 5b6aa1d

Update common.py

Browse files
Files changed (1) hide show
  1. common.py +50 -1
common.py CHANGED
@@ -1,7 +1,56 @@
1
  from fasthtml.common import *
2
  from fasthtml.components import *
3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
 
5
  def common_steps():
6
- return Div(Section(H2(P("Common Steps")), id="inner-text"))
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
 
 
1
  from fasthtml.common import *
2
  from fasthtml.components import *
3
 
4
+ dedup_intro = P("Deduplication is one of the most important steps when creating large web datasets for LLM pretraining. Methods to deduplicate datasets attempt to identify and remove redundant/repeated data from the dataset.")
5
+ dedup_why_text = P("Deduplication is beneficial for LM pretraining in several ways, the most obvious being the reduction of training data. With less training data, the model requires shorter training times to achieve the same or even better accuracy. Deduplication also helps avoid train-test overlap, thereby improving evaluation metrics. Additionally, it reduces the risk of memorization [1]. Duplicate data can lead to a strong double descent phenomenon, where repeated data causes test loss to increase midway through training [2]. By implementing deduplication and selective upsampling, we gain control over the pretraining data distribution, rather than relying on the inherent distribution of the source, which is often the internet.")
6
+ table1_explan = P("To illustrate this, below is the distribution of near-duplicate clusters, organized into buckets of 100. The first bucket contains clusters with sizes ranging from 2 to 100, as found in the Common Crawl dataset. Some clusters even reach up to a million documents.")
7
+ placeholder = P("THIS IS A PLACEHOLDER FOR A GRAPH")
8
+ example1_explan = P("The example below is from one such cluster. Here most of the text is repeated with just specifics changed.")
9
+ near_dup_img = Img(src="images/100k.png")
10
+ near_dedup_p1 = P("We started deduplication with 61.8 TB of high-quality, filtered, and compressed documents. The initial dataset had roughly 48.83 billion documents. First, we performed exact deduplication using a Bloom filter with a capacity of 1 billion and a false positive rate of 0.001. This reduced the documents from 48.83 billion to 40.21 billion, removing about 17% as exact duplicates. This step used constant memory for the Bloom filter and lessened the workload for subsequent near-deduplication.")
11
+ near_dedup_p2 = P("For the global near-deduplication, we employed a methodology used by prior works like SlimPajama [3] but scaled it to the entire dataset which includes 87 Common Crawl dumps (also called “crawls”) and the curated data. This near-deduplication process involved generating signatures for every document, matching these signatures to identify near-duplicates, and then clustering the near-duplicate documents to select all but one for deletion. We choose a curated document over a Common Crawl document and later in time dump than an earlier dump when we choose the one document we keep between the matching cluster. Additionally, we maintained statistics about these matching clusters as they were formed during the final stage of deduplication. Below are the details of all four stages of our deduplication pipeline. We use Dask extensively throughout all stages of the deduplication. We have included the size of results of each stage on disk to give an idea about the scale:")
12
+ minhash_p1 = P("We use the datasketch library to generate MinHash signatures with the number of permutations to 128. To calculate a signature, represented as a MinHash object for each document, we first clean the text by stripping whitespace, converting to lowercase, and removing punctuation, consecutive spaces, newlines, and tabs. Next, we generate a list of 13-grams to use as features for creating a document signature. These signatures along with globally-unique document ids are then saved to disk. We designed a document id encoding scheme to convert file names and line numbers (there is one document per line) to unique document ids. This also helped a lot in saving disk and memory for this stage. This step produced 20 TB of hashes.")
13
+ matching_pairs_p1 = P("We are using a Jaccard similarity threshold of 0.8 to identify near-duplicate documents. To do this, we divide the MinHashes into 9 bands, each with 13 hashes (also known as the range). To save memory during matching, we first store each band of MinHashes separately on disk. We then process each band individually. Within each band, documents are matched based on their hashes, and the matches are saved as document pairs. A document is considered a match if it matches another document in any of the 9 bands. Since we are looking for near-duplicates, a document may match multiple documents across different bands.")
14
+ matching_pairs_p2 = P("For partitioning and matching the hashes, we utilize Dask's bag data structure to load the document ids and MinHashes. The matching process is simply a group by operation on this bag data structure. This approach allows us to group matches efficiently and distribute the operation to multiple machines. Also doing a group by produces full components (documents that share the same signature) within a band which simplifies the later stages. The algorithm can be expressed using the Dask expression below:")
15
+ dask_algo = P("""dask.bag.from_sequence(doc_file_paths)
16
+ .map_partitions(stream_docs)
17
+ .groupby(lambda doc: doc["hash"])
18
+ .map_partitions(make_doc_pairs)
19
+ .compute()""")
20
+ matching_pairs_p3 = P("This step produced 9.2 TB of matching pairs from all bands.")
21
+ dup_pairs_p1 = P("Multiple bands can create the same document pairs, leading to duplicates. The simplest way to eliminate these duplicate pairs is to call distinct() before the compute(). However, we found that Dask is not very efficient when it comes to distributed distinct execution. Additionally, since we process each band separately, this approach wouldn’t remove duplicates across different bands.")
22
+ dup_pairs_p2 = P("To address this, we use a Bloom filter with a capacity of 64 billion and a false positive rate of 0.001 to remove duplicates. One way we parallelize the Bloom filter execution is by partitioning pairs horizontally and running one filter per partition, as shown in the table below. There is a high chance that duplicates from different bands will have the same pairs in the same horizontal partition. This step reduces the number of pairs by nearly ninefold.")
23
+ dup_pairs_p3 = P("The resulting unique pairs are then used to identify clusters of near-duplicates by finding connected components in a graph, where the vertices represent documents and the edges represent matches. This step produced 1.9 TB of unique pairs.")
24
+ mapreduce_img = Img(src="images/cc.png")
25
 
26
  def common_steps():
27
+ return Div(
28
+ Section(
29
+ H2(P("Global Steps")),
30
+ dedup_intro,
31
+ H2("Deduplication"),
32
+ H3("Why do we need deduplication?"),
33
+ dedup_why_text,
34
+ table1_explan,
35
+ placeholder,
36
+ near_dup_img,
37
+ near_dedup_p1,
38
+ near_dedup_p2,
39
+ H3("MinHash Generation"),
40
+ minhash_p1,
41
+ H3("Matching Pairs Generation"),
42
+ matching_pairs_p1,
43
+ matching_pairs_p2,
44
+ dask_algo,
45
+ matching_pairs_p3,
46
+ H3("Finding Duplicate Pairs"),
47
+ dup_pairs_p1,
48
+ dup_pairs_p2,
49
+ placeholder,
50
+ dup_pairs_p3,
51
+ H3("Finding Connected Components using MapReduce"),
52
+ mapreduce_img,
53
+ id="inner-text"
54
+ )
55
+ )
56