victormiller commited on
Commit
c9d881b
·
verified ·
1 Parent(s): 8bb7e3f

Update common.py

Browse files
Files changed (1) hide show
  1. common.py +3 -3
common.py CHANGED
@@ -265,7 +265,7 @@ global_div = Div(
265
  H3("Matching Pairs Generation"),
266
  P("We are using a Jaccard similarity threshold of 0.8 to identify near-duplicate documents. To do this, we divide the MinHashes into 9 bands, each with 13 hashes (also known as the range). To save memory during matching, we first store each band of MinHashes separately on disk. We then process each band individually. Within each band, documents are matched based on their hashes, and the matches are saved as document pairs. A document is considered a match if it matches another document in any of the 9 bands. Since we are looking for near-duplicates, a document may match multiple documents across different bands."),
267
  P("For partitioning and matching the hashes, we utilize Dask's bag data structure to load the document ids and MinHashes. The matching process is simply a group by operation on this bag data structure. This approach allows us to group matches efficiently and distribute the operation to multiple machines. Also doing a group by produces full components (documents that share the same signature) within a band which simplifies the later stages. The algorithm can be expressed using the Dask expression below:"),
268
- Pre(Code(dask_algo,)),
269
  P(B("This step produced 9.2 TB of matching pairs from all bands.")),
270
  ),
271
  Section(
@@ -303,7 +303,7 @@ global_div = Div(
303
  H3("Removing PII"),
304
  P("We have removed two types of PII from the dataset: email address and IP address. Regular expressions are used to identify and replace these PII with a generic placeholder. Below is an example of how we removed email addresses from the dataset:"),
305
  P("We have used the following regular expressions to identify and replace PII:"),
306
- Ul(Li("Email:"), Li(email_code, style="list-style-type: none"), Li("IP Address:"), Li(ip_address_code, style="list-style-type: none")),
307
  ),
308
  Section(
309
  H2("Normalization Form C (NFC)"),
@@ -313,7 +313,7 @@ global_div = Div(
313
  Section(
314
  H3("NFC Implementation"),
315
  P("We have used the ftfy library to normalize the text to NFC. The library provides a simple API for normalizing text to NFC, which can be applied to the entire dataset one row at a time. Below is the code snippet about how we normalized text to NFC:"),
316
- Ul(Li(nfc_code, style="list-style-type: none")), #"background-color= gray" "color= blue" maybe add this later
317
  ),
318
  Section(
319
  H3("NFC Examples"),
 
265
  H3("Matching Pairs Generation"),
266
  P("We are using a Jaccard similarity threshold of 0.8 to identify near-duplicate documents. To do this, we divide the MinHashes into 9 bands, each with 13 hashes (also known as the range). To save memory during matching, we first store each band of MinHashes separately on disk. We then process each band individually. Within each band, documents are matched based on their hashes, and the matches are saved as document pairs. A document is considered a match if it matches another document in any of the 9 bands. Since we are looking for near-duplicates, a document may match multiple documents across different bands."),
267
  P("For partitioning and matching the hashes, we utilize Dask's bag data structure to load the document ids and MinHashes. The matching process is simply a group by operation on this bag data structure. This approach allows us to group matches efficiently and distribute the operation to multiple machines. Also doing a group by produces full components (documents that share the same signature) within a band which simplifies the later stages. The algorithm can be expressed using the Dask expression below:"),
268
+ D_code(dask_algo, block="block", language="python"),
269
  P(B("This step produced 9.2 TB of matching pairs from all bands.")),
270
  ),
271
  Section(
 
303
  H3("Removing PII"),
304
  P("We have removed two types of PII from the dataset: email address and IP address. Regular expressions are used to identify and replace these PII with a generic placeholder. Below is an example of how we removed email addresses from the dataset:"),
305
  P("We have used the following regular expressions to identify and replace PII:"),
306
+ Ul(Li("Email:"), Li(D_code(email_code, block="block", language="python"), style="list-style-type: none"), Li("IP Address:"), Li(D_code(ip_address_code, block="block", language="python"), style="list-style-type: none")),
307
  ),
308
  Section(
309
  H2("Normalization Form C (NFC)"),
 
313
  Section(
314
  H3("NFC Implementation"),
315
  P("We have used the ftfy library to normalize the text to NFC. The library provides a simple API for normalizing text to NFC, which can be applied to the entire dataset one row at a time. Below is the code snippet about how we normalized text to NFC:"),
316
+ Ul(Li(D_code(nfc_code, block="block", language="python"), style="list-style-type: none")), #"background-color= gray" "color= blue" maybe add this later
317
  ),
318
  Section(
319
  H3("NFC Examples"),