victormiller commited on
Commit
073687e
1 Parent(s): 1e4c4a2

Update web.py

Browse files
Files changed (1) hide show
  1. web.py +9 -12
web.py CHANGED
@@ -442,15 +442,13 @@ def web_data():
442
  After text extraction, the non-English texts are then filtered out by fastText language identifier with a threshold of 0.65.
443
  This step removes over 60% of the whole data.
444
  """),
445
- Details(
446
- Summary("Sample documents that are classified as non-English"),
447
- DV("data/sample_non_en.json", 3),
448
- ),
449
 
450
- Details(
451
- Summary("Sample documents that are classified as English but with score less than 0.65"),
452
- DV("data/sample_en_low.json", 3),
453
- ),
 
 
454
 
455
  H4("1.3 URL Filtering"),
456
  P("""
@@ -483,10 +481,9 @@ def web_data():
483
  "curated url domains that are excluded from our dataset",
484
  ),
485
 
486
- Details(
487
- Summary("Sample documents whose urls are in our curated url domain list"),
488
- DV("data/sample_url_exclusion.json", 0),
489
- ),
490
  H3("2. Line-Level Removal"),
491
  P("""
492
  Before computing the quality signals that can be used for filtering low-quality documents, we perform the line-level
 
442
  After text extraction, the non-English texts are then filtered out by fastText language identifier with a threshold of 0.65.
443
  This step removes over 60% of the whole data.
444
  """),
 
 
 
 
445
 
446
+
447
+ DV("data/sample_non_en.json", 3, "Sample documents that are classified as non-English"),
448
+
449
+
450
+ DV("data/sample_en_low.json", 3, "Sample documents that are classified as English but with score less than 0.65"),
451
+
452
 
453
  H4("1.3 URL Filtering"),
454
  P("""
 
481
  "curated url domains that are excluded from our dataset",
482
  ),
483
 
484
+
485
+ DV("data/sample_url_exclusion.json", 0, "Sample documents whose urls are in our curated url domain list"),
486
+
 
487
  H3("2. Line-Level Removal"),
488
  P("""
489
  Before computing the quality signals that can be used for filtering low-quality documents, we perform the line-level