victormiller commited on
Commit
c535408
·
verified ·
1 Parent(s): db08107

Update common.py

Browse files
Files changed (1) hide show
  1. common.py +13 -6
common.py CHANGED
@@ -299,7 +299,8 @@ global_div = Div(
299
  style="margin-bottom: 5px",
300
  ),
301
  Li("Normailzation Form C Discussion", style="margin-bottom: 5px"),
302
- ),
 
303
  ),
304
  Section(
305
  H2("Motivation Behind Global Deduplication"),
@@ -331,16 +332,18 @@ global_div = Div(
331
  P(
332
  "Additionally, we maintained statistics about each matching clusters as they were formed during the final stage of deduplication. Below are the details of all four stages of our deduplication pipeline. We use Dask extensively throughout all stages of the deduplication. We have included the size of results of each stage on disk to give an idea about the scale:"
333
  ),
 
334
  ),
335
  Section(
336
- H3("Stage 1: MinHash Generation"),
337
  P(
338
  "We use the datasketch library to generate MinHash signatures with the number of permutations to 128. Each signature is signature represented as a MinHash object for each document. Before caluclating the signature, the text is cleaned by stripping whitespace, converting to lowercase, and removing punctuation, consecutive spaces, newlines, and tabs. Next, a list of 13-grams is generated to use as features for creating a document signature. The globally-unique document IDs and signatures are then saved to disk. The documented ID is designed by an encoding scheme which converts file names and line numbers (there is one document per line) to unique document IDs. This also helped a lot in saving disk and memory for this stage."
339
  ),
340
  P(B("This step produced 20 TB of hashes.")),
 
341
  ),
342
  Section(
343
- H3("Stage 2: Matching Pairs Generation"),
344
  P(
345
  "We are using a Jaccard similarity threshold of 0.8 to identify near-duplicate documents. To do this, we divide the MinHashes into 9 bands, each with 13 hashes (also known as the range). To save memory during matching, we first store each band of MinHashes separately on disk. We then process each band individually. Within each band, documents are matched based on their hashes, and the matches are saved as document pairs. A document is considered a match if it matches another document in any of the 9 bands. Since we are looking for near-duplicates, a document may match multiple documents across different bands."
346
  ),
@@ -349,9 +352,10 @@ global_div = Div(
349
  ),
350
  D_code(dask_algo, block="block", language="python"),
351
  P(B("This step produced 9.2 TB of matching pairs from all bands.")),
 
352
  ),
353
  Section(
354
- H3("Stage 3: Finding Duplicate Pairs"),
355
  P(
356
  "Multiple bands can create the same document pairs, leading to duplicates. The simplest way to eliminate these duplicate pairs is to call distinct() before the compute(). However, we found that Dask is not very efficient when it comes to distributed distinct execution. Additionally, since we process each band separately, this approach wouldn’t remove duplicates across different bands."
357
  ),
@@ -366,9 +370,10 @@ global_div = Div(
366
  "The resulting unique pairs are then used to identify clusters of near-duplicates by finding connected components in a graph, where the vertices represent documents and the edges represent matches."
367
  ),
368
  P(B("This step produced 1.9 TB of unique pairs.")),
 
369
  ),
370
  Section(
371
- H3("Stage 4: Finding Connected Components using MapReduce"),
372
  Img(src="images/findcc.svg", style="max-width: 100%;"),
373
  P(
374
  "The purpose of this step is to create a set of clusters of matching pairs. For example, a list of pairs (A, B), (B, C), (D, E) is merged into a list of components (A, B, C) and (D, E). Using a third-party library like NetworkX to find connected components would require all pairs to fit into the memory of a single machine, which is not feasible. Instead, we implemented a distributed connected component finder [4] using the Dask framework, which can scale across multiple machines. The algorithm works by mapping edges by both the source and destination of pairs and reducing only edges where the source is greater than the destination. It performs successive iterations of this MapReduce computation until convergence, meaning the number of new edges produced becomes zero. In the end, every document in a cluster points to the smallest document within the cluster. Later, we compile a list of duplicate documents that need deletion and gather statistics about each component."
@@ -380,6 +385,7 @@ global_div = Div(
380
  "Below is the distribution of duplicate documents found across different dumps of CommonCrawl. The distribution is skewed to the right because the documents are bucketed by the dump ID of the document we retain, and we prefer documents from higher dump IDs."
381
  ),
382
  plotly2fasthtml(dup_docs_count_graph()),
 
383
  ),
384
  Section(
385
  H3("Analysis of Near-Duplicate Clusters"),
@@ -424,6 +430,7 @@ global_div = Div(
424
  style="list-style-type: none",
425
  ),
426
  ),
 
427
  ),
428
  Section(
429
  H2("Normalization Form C"),
@@ -443,13 +450,13 @@ global_div = Div(
443
  style="list-style-type: none",
444
  )
445
  ), # "background-color= gray" "color= blue" maybe add this later
 
446
  ),
447
  Section(
448
  H3("NFC Examples"),
449
  table_div_nfc_examples,
450
  ),
451
  Section(H3("Conclusion"), P("NEED TO UPDATE")),
452
- id="section1"
453
  )
454
 
455
 
 
299
  style="margin-bottom: 5px",
300
  ),
301
  Li("Normailzation Form C Discussion", style="margin-bottom: 5px"),
302
+ ),
303
+ id="section1",
304
  ),
305
  Section(
306
  H2("Motivation Behind Global Deduplication"),
 
332
  P(
333
  "Additionally, we maintained statistics about each matching clusters as they were formed during the final stage of deduplication. Below are the details of all four stages of our deduplication pipeline. We use Dask extensively throughout all stages of the deduplication. We have included the size of results of each stage on disk to give an idea about the scale:"
334
  ),
335
+ id="section2",
336
  ),
337
  Section(
338
+ H3("MinHash Generation"),
339
  P(
340
  "We use the datasketch library to generate MinHash signatures with the number of permutations to 128. Each signature is signature represented as a MinHash object for each document. Before caluclating the signature, the text is cleaned by stripping whitespace, converting to lowercase, and removing punctuation, consecutive spaces, newlines, and tabs. Next, a list of 13-grams is generated to use as features for creating a document signature. The globally-unique document IDs and signatures are then saved to disk. The documented ID is designed by an encoding scheme which converts file names and line numbers (there is one document per line) to unique document IDs. This also helped a lot in saving disk and memory for this stage."
341
  ),
342
  P(B("This step produced 20 TB of hashes.")),
343
+ id="section3",
344
  ),
345
  Section(
346
+ H3("Matching Pairs Generation"),
347
  P(
348
  "We are using a Jaccard similarity threshold of 0.8 to identify near-duplicate documents. To do this, we divide the MinHashes into 9 bands, each with 13 hashes (also known as the range). To save memory during matching, we first store each band of MinHashes separately on disk. We then process each band individually. Within each band, documents are matched based on their hashes, and the matches are saved as document pairs. A document is considered a match if it matches another document in any of the 9 bands. Since we are looking for near-duplicates, a document may match multiple documents across different bands."
349
  ),
 
352
  ),
353
  D_code(dask_algo, block="block", language="python"),
354
  P(B("This step produced 9.2 TB of matching pairs from all bands.")),
355
+ id="section4",
356
  ),
357
  Section(
358
+ H3("Finding Duplicate Pairs"),
359
  P(
360
  "Multiple bands can create the same document pairs, leading to duplicates. The simplest way to eliminate these duplicate pairs is to call distinct() before the compute(). However, we found that Dask is not very efficient when it comes to distributed distinct execution. Additionally, since we process each band separately, this approach wouldn’t remove duplicates across different bands."
361
  ),
 
370
  "The resulting unique pairs are then used to identify clusters of near-duplicates by finding connected components in a graph, where the vertices represent documents and the edges represent matches."
371
  ),
372
  P(B("This step produced 1.9 TB of unique pairs.")),
373
+ id="section5",
374
  ),
375
  Section(
376
+ H3("Finding Connected Components using MapReduce"),
377
  Img(src="images/findcc.svg", style="max-width: 100%;"),
378
  P(
379
  "The purpose of this step is to create a set of clusters of matching pairs. For example, a list of pairs (A, B), (B, C), (D, E) is merged into a list of components (A, B, C) and (D, E). Using a third-party library like NetworkX to find connected components would require all pairs to fit into the memory of a single machine, which is not feasible. Instead, we implemented a distributed connected component finder [4] using the Dask framework, which can scale across multiple machines. The algorithm works by mapping edges by both the source and destination of pairs and reducing only edges where the source is greater than the destination. It performs successive iterations of this MapReduce computation until convergence, meaning the number of new edges produced becomes zero. In the end, every document in a cluster points to the smallest document within the cluster. Later, we compile a list of duplicate documents that need deletion and gather statistics about each component."
 
385
  "Below is the distribution of duplicate documents found across different dumps of CommonCrawl. The distribution is skewed to the right because the documents are bucketed by the dump ID of the document we retain, and we prefer documents from higher dump IDs."
386
  ),
387
  plotly2fasthtml(dup_docs_count_graph()),
388
+ id="section6",
389
  ),
390
  Section(
391
  H3("Analysis of Near-Duplicate Clusters"),
 
430
  style="list-style-type: none",
431
  ),
432
  ),
433
+ id="section7",
434
  ),
435
  Section(
436
  H2("Normalization Form C"),
 
450
  style="list-style-type: none",
451
  )
452
  ), # "background-color= gray" "color= blue" maybe add this later
453
+ id="section8",
454
  ),
455
  Section(
456
  H3("NFC Examples"),
457
  table_div_nfc_examples,
458
  ),
459
  Section(H3("Conclusion"), P("NEED TO UPDATE")),
 
460
  )
461
 
462