falca commited on
Commit
7b324fb
β€’
1 Parent(s): 7506cb1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -49,7 +49,7 @@ A mix of the following data: Wikipedia, Books, Twitter comments, Pikabu, Proza.r
49
  1. Calculate shingles with size of 5
50
  2. Calculate MinHash with 100 seeds β†’ for every sample (text) have a hash of size 100
51
  3. Split every hash into 10 buckets β†’ every bucket, which contains (100 / 10) = 10 numbers, get hashed into 1 hash β†’ we have 10 hashes for every sample
52
- 4. For each bucket find duplicates: find samples which have the same hash β†’ calculate pair-wise jaccard distance of similarity β†’ if the similarity is >0.7 than it's a duplicate
53
  5. Gather duplicates from all the buckets and filter
54
 
55
  ### Training Hyperparameters
 
49
  1. Calculate shingles with size of 5
50
  2. Calculate MinHash with 100 seeds β†’ for every sample (text) have a hash of size 100
51
  3. Split every hash into 10 buckets β†’ every bucket, which contains (100 / 10) = 10 numbers, get hashed into 1 hash β†’ we have 10 hashes for every sample
52
+ 4. For each bucket find duplicates: find samples which have the same hash β†’ calculate pair-wise jaccard similarity β†’ if the similarity is >0.7 than it's a duplicate
53
  5. Gather duplicates from all the buckets and filter
54
 
55
  ### Training Hyperparameters