victormiller commited on
Commit
c790d40
·
verified ·
1 Parent(s): f5cf42d

Update common.py

Browse files
Files changed (1) hide show
  1. common.py +41 -23
common.py CHANGED
@@ -244,41 +244,55 @@ table_div_pii = Div(NotStr(table_html_pii), style="margin: 40px;")
244
 
245
  global_div = Div(
246
  Section(
247
- H2("Global Steps"),
248
- P("Deduplication is one of the most important steps when creating large web datasets for LLM pretraining. Methods to deduplicate datasets attempt to identify and remove redundant/repeated data from the dataset."),
 
 
 
 
 
 
 
249
  ),
250
  Section(
251
- H2("Deduplication"),
252
- H3("Why do we need deduplication?"),
253
- P("Deduplication is beneficial for LM pretraining in several ways, the most obvious being the reduction of training data. With less training data, the model requires shorter training times to achieve the same or even better accuracy. Deduplication also helps avoid train-test overlap, thereby improving evaluation metrics. Additionally, it reduces the risk of memorization [1]. Duplicate data can lead to a strong double descent phenomenon, where repeated data causes test loss to increase midway through training [2]. By implementing deduplication and selective upsampling, we gain control over the pretraining data distribution, rather than relying on the inherent distribution of the source, which is often the internet."),
254
- P("To illustrate this, below is the distribution of near-duplicate clusters, organized into buckets of 100. The first bucket contains clusters with sizes ranging from 2 to 100, as found in the Common Crawl dataset. Some clusters even reach up to a million documents."),
255
  plotly2fasthtml(dup_cluster_graph),
256
  Img(src="images/100k.png", height = "300", width = "600" ),
257
- P("We started deduplication with 61.8 TB of high-quality, filtered, and compressed documents. The initial dataset had roughly 48.83 billion documents. First, we performed exact deduplication using a Bloom filter with a capacity of 1 billion and a false positive rate of 0.001. This reduced the documents from 48.83 billion to 40.21 billion, removing about 17% as exact duplicates. This step used constant memory for the Bloom filter and lessened the workload for subsequent near-deduplication."),
258
- P("For the global near-deduplication, we employed a methodology used by prior works like SlimPajama [3] but scaled it to the entire dataset which includes 87 Common Crawl dumps (also called “crawls”) and the curated data. This near-deduplication process involved generating signatures for every document, matching these signatures to identify near-duplicates, and then clustering the near-duplicate documents to select all but one for deletion. We choose a curated document over a Common Crawl document and later in time dump than an earlier dump when we choose the one document we keep between the matching cluster. Additionally, we maintained statistics about these matching clusters as they were formed during the final stage of deduplication. Below are the details of all four stages of our deduplication pipeline. We use Dask extensively throughout all stages of the deduplication. We have included the size of results of each stage on disk to give an idea about the scale:"),
 
 
 
 
 
 
259
  ),
260
  Section(
261
- H3("MinHash Generation"),
262
- P("We use the datasketch library to generate MinHash signatures with the number of permutations to 128. To calculate a signature, represented as a MinHash object for each document, we first clean the text by stripping whitespace, converting to lowercase, and removing punctuation, consecutive spaces, newlines, and tabs. Next, we generate a list of 13-grams to use as features for creating a document signature. These signatures along with globally-unique document ids are then saved to disk. We designed a document id encoding scheme to convert file names and line numbers (there is one document per line) to unique document ids. This also helped a lot in saving disk and memory for this stage."),
263
  P(B("This step produced 20 TB of hashes.")),
264
  ),
265
  Section(
266
- H3("Matching Pairs Generation"),
267
  P("We are using a Jaccard similarity threshold of 0.8 to identify near-duplicate documents. To do this, we divide the MinHashes into 9 bands, each with 13 hashes (also known as the range). To save memory during matching, we first store each band of MinHashes separately on disk. We then process each band individually. Within each band, documents are matched based on their hashes, and the matches are saved as document pairs. A document is considered a match if it matches another document in any of the 9 bands. Since we are looking for near-duplicates, a document may match multiple documents across different bands."),
268
- P("For partitioning and matching the hashes, we utilize Dask's bag data structure to load the document ids and MinHashes. The matching process is simply a group by operation on this bag data structure. This approach allows us to group matches efficiently and distribute the operation to multiple machines. Also doing a group by produces full components (documents that share the same signature) within a band which simplifies the later stages. The algorithm can be expressed using the Dask expression below:"),
269
  D_code(dask_algo, block="block", language="python"),
270
  P(B("This step produced 9.2 TB of matching pairs from all bands.")),
271
  ),
272
  Section(
273
- H3("Finding Duplicate Pairs"),
274
  P("Multiple bands can create the same document pairs, leading to duplicates. The simplest way to eliminate these duplicate pairs is to call distinct() before the compute(). However, we found that Dask is not very efficient when it comes to distributed distinct execution. Additionally, since we process each band separately, this approach wouldn’t remove duplicates across different bands."),
275
- P("To address this, we use a Bloom filter with a capacity of 64 billion and a false positive rate of 0.001 to remove duplicates. One way we parallelize the Bloom filter execution is by partitioning pairs horizontally and running one filter per partition, as shown in the table below. There is a high chance that duplicates from different bands will have the same pairs in the same horizontal partition. This step reduces the number of pairs by nearly ninefold."),
 
276
  table_div_bloom_examples,
277
  P("The resulting unique pairs are then used to identify clusters of near-duplicates by finding connected components in a graph, where the vertices represent documents and the edges represent matches."),
278
  P(B("This step produced 1.9 TB of unique pairs.")),
279
  ),
280
  Section(
281
- H3("Finding Connected Components using MapReduce"),
282
  Img(src="images/cc.png", height = "300", width = "600" ),
283
  P("The purpose of this step is to create a set of clusters of matching pairs. For example, a list of pairs (A, B), (B, C), (D, E) is merged into a list of components (A, B, C) and (D, E). Using a third-party library like NetworkX to find connected components would require all pairs to fit into the memory of a single machine, which is not feasible. Instead, we implemented a distributed connected component finder [4] using the Dask framework, which can scale across multiple machines. The algorithm works by mapping edges by both the source and destination of pairs and reducing only edges where the source is greater than the destination. It performs successive iterations of this MapReduce computation until convergence, meaning the number of new edges produced becomes zero. In the end, every document in a cluster points to the smallest document within the cluster. Later, we compile a list of duplicate documents that need deletion and gather statistics about each component."),
284
  P("We needed to partition the duplicate pairs generated in the third stage into three groups to reduce memory pressure on the final stage. We observed that the second stage itself generates partial components which have some overlap. These overlapping clusters cause some documents to appear in the delete set multiple times. However, our deletion code handled this overlap."),
@@ -287,17 +301,17 @@ global_div = Div(
287
  ),
288
  Section(
289
  H3("Analysis of Near-Duplicate Clusters"),
290
- P("Smaller components tend to have more overlap in their MinHash bands. Smallest components which are basically pairs have exact duplicate documents. The ones that the local exact deduplication missed."),
291
  Img(src="images/image3.png", height = "300", width = "600" ),
292
- P("From 3 or more documents per cluster you see changes in the text that are incremental. For example below there is a growing list of personnel over the years."),
293
  Img(src="images/image7.png", height = "300", width = "600" ),
294
  P("In sizable clusters comprising 1000 or more documents, we observe a trend towards templatization. This involves the recurrent use of standardized language to convey general topics such as terms and conditions, warnings, and disclaimers. Such language is prevalent on commercial websites, offering a consistent and efficient way to communicate commonly encountered information."),
295
  Img(src="images/image9.png", height = "300", width = "600" ),
296
  ),
297
  Section(
298
- H2("PII Removal"),
299
- H3("Motivation Behind PII Removal"),
300
- P("PII refers to any information that can be used to identify an individual, such as names, addresses, phone numbers, email addresses, and social security numbers. PII removal is essential for data privacy and security, as well as for compliance with global regulations. By removing PII from the training data, we can reduce the risk of data breaches and unauthorized access to sensitive information. Additionally, models can also generate PII during inference time."),
301
  table_div_pii,
302
  ),
303
  Section(
@@ -307,9 +321,9 @@ global_div = Div(
307
  Ul(Li("Email:"), Li(D_code(email_code, block="block", language="python"), style="list-style-type: none"), Li("IP Address:"), Li(D_code(ip_address_code, block="block", language="python"), style="list-style-type: none")),
308
  ),
309
  Section(
310
- H2("Normalization Form C (NFC)"),
311
- H3("NFC Defined"),
312
- P("NFC (Normalization Form C) is a Unicode normalization form that combines characters with diacritics into a single code point. This is important for text processing tasks as it ensures that the text is consistently represented across different languages and scripts. By normalizing the text to NFC, we can avoid issues related to character encoding, such as duplicate tokens and incorrect tokenization."),
313
  ),
314
  Section(
315
  H3("NFC Implementation"),
@@ -320,6 +334,10 @@ global_div = Div(
320
  H3("NFC Examples"),
321
  table_div_nfc_examples,
322
  ),
 
 
 
 
323
  )
324
 
325
 
 
244
 
245
  global_div = Div(
246
  Section(
247
+ H2("Overview"),
248
+ H3("What This Section Contains"),
249
+ P("This section discusses all details related to deduplication and filterings steps that were uniformly applied to all data. The section is split into the following topic areas: "),
250
+ Ul(
251
+ Li("Motivation Behind Global Deduplication", style = "margin-bottom: 5px"),
252
+ Li("TxT360 Deduplication Process and Implementation", style = "margin-bottom: 5px"),
253
+ Li("Personally Identifiable Information Removal", style = "margin-bottom: 5px"),
254
+ Li("Normailzation Form C Discussion", style = "margin-bottom: 5px"),
255
+ ),
256
  ),
257
  Section(
258
+ H2("Motivation Behind Global Deduplication"),
259
+ P("Deduplication is beneficial for LM pretraining in several ways, with the most important being controllable upsampling. With unique data, teams gain fine-grained control over the training data. Other benefits of deduplication include avoiding train-test overlap which prevents evaluation contamination."),
260
+ P("Duplicate data can lead to a strong double descent phenomenon, where repeated data causes test loss to increase midway through training [2]. Additionally, it reduces the risk of memorization [1]. By implementing deduplication and selective upsampling, we gain control over the pretraining data distribution, rather than relying on the inherent distribution of the source."),
261
+ P("To illustrate the need for deduplication, below is the distribution of near-duplicate clusters, organized into buckets of 100. The first bucket contains clusters with sizes ranging from 2 to 100, as found in the Common Crawl dataset. Some clusters even reach up to a million documents."),
262
  plotly2fasthtml(dup_cluster_graph),
263
  Img(src="images/100k.png", height = "300", width = "600" ),
264
+ P("We started deduplication with 61.8 TB of filtered and compressed documents. The initial dataset had roughly 48.83 billion documents. First, we performed exact deduplication using a Bloom filter with a capacity of 1 billion and a false positive rate of 0.001. This reduced the documents from 48.83 billion to 40.21 billion, removing about 17% as exact duplicates. This step used constant memory for the Bloom filter and lessened the workload for subsequent near-deduplication."),
265
+ P("For the global near-deduplication, we employed a methodology used by prior works like SlimPajama [3] but scaled it to the entire dataset which includes 99 Common Crawl dumps (also called “crawls”) and the curated data. The near-deduplication process involved generating signatures for every document, matching these signatures to identify near-duplicates, and then clustering the near-duplicate documents to select all but one for deletion."),
266
+ P("We applied the following inclusion criteria for all documents:"),
267
+ Ul(
268
+ Li("Curated Document > Common Crawl Document", style = "margin-bottom: 5px"),
269
+ Li("Most Recent > Less Recent", style = "margin-bottom: 5px"),
270
+ ),
271
+ P("Additionally, we maintained statistics about each matching clusters as they were formed during the final stage of deduplication. Below are the details of all four stages of our deduplication pipeline. We use Dask extensively throughout all stages of the deduplication. We have included the size of results of each stage on disk to give an idea about the scale:"),
272
  ),
273
  Section(
274
+ H3("Stage 1: MinHash Generation"),
275
+ P("We use the datasketch library to generate MinHash signatures with the number of permutations to 128. Each signature is signature represented as a MinHash object for each document. Before caluclating the signature, the text is cleaned by stripping whitespace, converting to lowercase, and removing punctuation, consecutive spaces, newlines, and tabs. Next, a list of 13-grams is generated to use as features for creating a document signature. The globally-unique document IDs and signatures are then saved to disk. The documented ID is designed by an encoding scheme which converts file names and line numbers (there is one document per line) to unique document IDs. This also helped a lot in saving disk and memory for this stage."),
276
  P(B("This step produced 20 TB of hashes.")),
277
  ),
278
  Section(
279
+ H3("Stage 2: Matching Pairs Generation"),
280
  P("We are using a Jaccard similarity threshold of 0.8 to identify near-duplicate documents. To do this, we divide the MinHashes into 9 bands, each with 13 hashes (also known as the range). To save memory during matching, we first store each band of MinHashes separately on disk. We then process each band individually. Within each band, documents are matched based on their hashes, and the matches are saved as document pairs. A document is considered a match if it matches another document in any of the 9 bands. Since we are looking for near-duplicates, a document may match multiple documents across different bands."),
281
+ P("For partitioning and matching the hashes, we utilize Dask's bag data structure to load the document ids and MinHashes. The matching process is simply a group by operation on this bag data structure. This approach allows us to group matches efficiently and distribute the operation to multiple machines. The group by produces full components (documents that share the same signature) within a band which simplifies the later stages. The algorithm can be expressed using the Dask expression below:"),
282
  D_code(dask_algo, block="block", language="python"),
283
  P(B("This step produced 9.2 TB of matching pairs from all bands.")),
284
  ),
285
  Section(
286
+ H3("Stage 3: Finding Duplicate Pairs"),
287
  P("Multiple bands can create the same document pairs, leading to duplicates. The simplest way to eliminate these duplicate pairs is to call distinct() before the compute(). However, we found that Dask is not very efficient when it comes to distributed distinct execution. Additionally, since we process each band separately, this approach wouldn’t remove duplicates across different bands."),
288
+ P("To address this, we use a Bloom filter with a capacity of 64 billion and a false positive rate of 0.001 to remove duplicates. We parallelize the Bloom filter execution is by partitioning pairs horizontally and running one filter per partition, as shown in the table below. Note: this step was completed in ~5 days by parallelizing the Bloom filter versus ~25 days if the filter was serialized."),
289
+ P("There is a high chance that duplicates from different bands will have the same pairs in the same horizontal partition. Performing the Bloom filter step reduces the number of pairs by nearly ninefold."),
290
  table_div_bloom_examples,
291
  P("The resulting unique pairs are then used to identify clusters of near-duplicates by finding connected components in a graph, where the vertices represent documents and the edges represent matches."),
292
  P(B("This step produced 1.9 TB of unique pairs.")),
293
  ),
294
  Section(
295
+ H3("Stage 4: Finding Connected Components using MapReduce"),
296
  Img(src="images/cc.png", height = "300", width = "600" ),
297
  P("The purpose of this step is to create a set of clusters of matching pairs. For example, a list of pairs (A, B), (B, C), (D, E) is merged into a list of components (A, B, C) and (D, E). Using a third-party library like NetworkX to find connected components would require all pairs to fit into the memory of a single machine, which is not feasible. Instead, we implemented a distributed connected component finder [4] using the Dask framework, which can scale across multiple machines. The algorithm works by mapping edges by both the source and destination of pairs and reducing only edges where the source is greater than the destination. It performs successive iterations of this MapReduce computation until convergence, meaning the number of new edges produced becomes zero. In the end, every document in a cluster points to the smallest document within the cluster. Later, we compile a list of duplicate documents that need deletion and gather statistics about each component."),
298
  P("We needed to partition the duplicate pairs generated in the third stage into three groups to reduce memory pressure on the final stage. We observed that the second stage itself generates partial components which have some overlap. These overlapping clusters cause some documents to appear in the delete set multiple times. However, our deletion code handled this overlap."),
 
301
  ),
302
  Section(
303
  H3("Analysis of Near-Duplicate Clusters"),
304
+ P("Smaller components tend to have more overlap in their MinHash bands. The smallest components are almost exact pairs but due to small differences, were not included in the local exact deduplication."),
305
  Img(src="images/image3.png", height = "300", width = "600" ),
306
+ P("Changes in text are incremental from buckets of 3 or more documents onwards. The example below shows a personnel list that has grown over the years."),
307
  Img(src="images/image7.png", height = "300", width = "600" ),
308
  P("In sizable clusters comprising 1000 or more documents, we observe a trend towards templatization. This involves the recurrent use of standardized language to convey general topics such as terms and conditions, warnings, and disclaimers. Such language is prevalent on commercial websites, offering a consistent and efficient way to communicate commonly encountered information."),
309
  Img(src="images/image9.png", height = "300", width = "600" ),
310
  ),
311
  Section(
312
+ H2("Personally Identifable Information Removal"),
313
+ H3("Motivation Behind Personally Identifable Information Removal"),
314
+ P("Personally Identifable Information (PII) refers to any information that can be used to identify an individual, such as names, addresses, phone numbers, email addresses, and social security numbers. PII removal is essential for data privacy and security, as well as for compliance with global regulations. By removing PII from the training data, we can reduce the risk of data breaches and unauthorized access to sensitive information. Additionally, removing PII from training data prevents the models generating that specific PII during inference time."),
315
  table_div_pii,
316
  ),
317
  Section(
 
321
  Ul(Li("Email:"), Li(D_code(email_code, block="block", language="python"), style="list-style-type: none"), Li("IP Address:"), Li(D_code(ip_address_code, block="block", language="python"), style="list-style-type: none")),
322
  ),
323
  Section(
324
+ H2("Normalization Form C"),
325
+ H3("Normalization Form C Defined"),
326
+ P("Normalization Form C (NFC) is a Unicode normalization form that combines characters with diacritics into a single code point. This is important for text processing tasks as it ensures that the text is consistently represented across different languages and scripts. By normalizing the text to NFC, we can avoid issues related to character encoding, such as duplicate tokens and incorrect tokenization."),
327
  ),
328
  Section(
329
  H3("NFC Implementation"),
 
334
  H3("NFC Examples"),
335
  table_div_nfc_examples,
336
  ),
337
+ Section(
338
+ H3("Conclusion"),
339
+ P("NEED TO UPDATE")
340
+ ),
341
  )
342
 
343