victormiller commited on
Commit
671e619
1 Parent(s): 1e5cb0a

Update common.py

Browse files
Files changed (1) hide show
  1. common.py +4 -2
common.py CHANGED
@@ -50,11 +50,13 @@ global_div = Div(
50
  P("We are using a Jaccard similarity threshold of 0.8 to identify near-duplicate documents. To do this, we divide the MinHashes into 9 bands, each with 13 hashes (also known as the range). To save memory during matching, we first store each band of MinHashes separately on disk. We then process each band individually. Within each band, documents are matched based on their hashes, and the matches are saved as document pairs. A document is considered a match if it matches another document in any of the 9 bands. Since we are looking for near-duplicates, a document may match multiple documents across different bands."),
51
  P("For partitioning and matching the hashes, we utilize Dask's bag data structure to load the document ids and MinHashes. The matching process is simply a group by operation on this bag data structure. This approach allows us to group matches efficiently and distribute the operation to multiple machines. Also doing a group by produces full components (documents that share the same signature) within a band which simplifies the later stages. The algorithm can be expressed using the Dask expression below:"),
52
  Div(
53
- Code("""dask.bag.from_sequence(doc_file_paths)
 
54
  .map_partitions(stream_docs)
55
  .groupby(lambda doc: doc["hash"])
56
  .map_partitions(make_doc_pairs)
57
- .compute()"""),
 
58
  cls="code-block",
59
  ),
60
  P("This step produced 9.2 TB of matching pairs from all bands."),
 
50
  P("We are using a Jaccard similarity threshold of 0.8 to identify near-duplicate documents. To do this, we divide the MinHashes into 9 bands, each with 13 hashes (also known as the range). To save memory during matching, we first store each band of MinHashes separately on disk. We then process each band individually. Within each band, documents are matched based on their hashes, and the matches are saved as document pairs. A document is considered a match if it matches another document in any of the 9 bands. Since we are looking for near-duplicates, a document may match multiple documents across different bands."),
51
  P("For partitioning and matching the hashes, we utilize Dask's bag data structure to load the document ids and MinHashes. The matching process is simply a group by operation on this bag data structure. This approach allows us to group matches efficiently and distribute the operation to multiple machines. Also doing a group by produces full components (documents that share the same signature) within a band which simplifies the later stages. The algorithm can be expressed using the Dask expression below:"),
52
  Div(
53
+ Code("""
54
+ dask.bag.from_sequence(doc_file_paths)
55
  .map_partitions(stream_docs)
56
  .groupby(lambda doc: doc["hash"])
57
  .map_partitions(make_doc_pairs)
58
+ .compute()
59
+ """),
60
  cls="code-block",
61
  ),
62
  P("This step produced 9.2 TB of matching pairs from all bands."),