victormiller commited on
Commit
88c0211
1 Parent(s): 6084136

Update web.py

Browse files
Files changed (1) hide show
  1. web.py +49 -20
web.py CHANGED
@@ -476,25 +476,38 @@ def web_data():
476
  P("""
477
  We manually removed the following 6 domains from the UT1 blocklist so that they will not be removed from our dataset.
478
  """),
 
 
 
 
479
 
480
- DVS(urls_false_positives, "6 url domains that are removed from the blocklist"),
481
-
482
- DV(
483
  "data/bad_url_doc.jsonl",
484
  3,
485
  "Sample documents whose urls are blocked by the refined url blocklist",
 
486
  ),
 
487
  H5("1.3.2 Excluded High Quality Sources"),
488
  P("""
489
  To avoid duplication with our high-quality curated datasets, we exclude the following domains from our dataset.
490
  """),
491
- DVS(
492
- non_web_urls,
493
- "curated url domains that are excluded from our dataset",
494
- ),
495
 
 
 
 
 
 
 
 
496
 
497
- DV("data/sample_url_exclusion.json", 0, "Sample documents whose urls are in our curated url domain list"),
 
 
 
 
498
 
499
  H3("2. Line-Level Removal"),
500
  P("""
@@ -510,11 +523,17 @@ def web_data():
510
  of 56,292 additional lines, resulting in the complete exclusion of 2,203 documents from a total of 13,560
511
  documents (16.25%). Accordingly, we choose to not use terminal punctuation as a signal to remove lines.
512
  """),
513
- DV(
 
 
 
514
  "data/sample_terminal_punc.json",
515
  0,
516
  "Sample documents with lines that are removed by the rule of terminal punctuation",
 
517
  ),
 
 
518
  H4('2.1 Word "Javascript"'),
519
  P("""
520
  In C4 [5], the authors remove any line with the word "Javascript" since they found that many of the scraped
@@ -523,10 +542,13 @@ def web_data():
523
  propose to refine the strategy by adding one more keyword to the word "javascript" to avoid false positives.
524
  The additional keyword could be any one of “enable” / “disable” / “require” / “activate” / “browser”.
525
  """),
526
- DV(
527
- "data/sample_java.jsonl",
528
- 0,
529
- "Sample documents that are removed by original C4 javascript rule but are kept after our refinement",
 
 
 
530
  ),
531
  H4("2.2 Other Rules from RefinedWeb"),
532
  P("""
@@ -536,10 +558,13 @@ def web_data():
536
  - The line matches the pattern “r'^\\d+\\s+likes$'”,
537
  - The line contains only one word.
538
  """),
539
- DV(
540
- "data/sample_refinedweb_line.json",
541
- 0,
542
- "Sample documents with lines that are removed by the RefinedWeb rules",
 
 
 
543
  ),
544
  H4("2.3 Toxic Lines"),
545
  P("""
@@ -549,10 +574,14 @@ def web_data():
549
  line is in the first 3 lines or in the last 3 lines) to remove toxic lines. Specifically, we do not only consider
550
  the bad words from English but also consider the bad words from other languages.
551
  """),
552
- DVS(
553
- json.load(open("data/toxic_lines.json")),
554
- "Sample documents with toxic lines",
 
 
 
555
  ),
 
556
  H3("3. Document-Level Filtering"),
557
  P("""
558
  In this section, we introduce all the quality signals that we have used to filter out low-quality documents.
 
476
  P("""
477
  We manually removed the following 6 domains from the UT1 blocklist so that they will not be removed from our dataset.
478
  """),
479
+ Details(
480
+ Summary("6 url domains that are removed from the blocklist"),
481
+ DVS(urls_false_positives, "6 url domains that are removed from the blocklist"),
482
+ ),
483
 
484
+ Details(
485
+ Summary("Sample documents whose urls are blocked by the refined url blocklist"),
486
+ DV(
487
  "data/bad_url_doc.jsonl",
488
  3,
489
  "Sample documents whose urls are blocked by the refined url blocklist",
490
+ ),
491
  ),
492
+
493
  H5("1.3.2 Excluded High Quality Sources"),
494
  P("""
495
  To avoid duplication with our high-quality curated datasets, we exclude the following domains from our dataset.
496
  """),
 
 
 
 
497
 
498
+ Details(
499
+ Summary("curated url domains that are excluded from our dataset"),
500
+ DVS(
501
+ non_web_urls,
502
+ "curated url domains that are excluded from our dataset",
503
+ ),
504
+ ),
505
 
506
+ Details(
507
+ Summary("Sample documents whose urls are in our curated url domain list"),
508
+ DV("data/sample_url_exclusion.json", 0, "Sample documents whose urls are in our curated url domain list"),
509
+ ),
510
+
511
 
512
  H3("2. Line-Level Removal"),
513
  P("""
 
523
  of 56,292 additional lines, resulting in the complete exclusion of 2,203 documents from a total of 13,560
524
  documents (16.25%). Accordingly, we choose to not use terminal punctuation as a signal to remove lines.
525
  """),
526
+
527
+ Details(
528
+ Summary("Sample documents with lines that are removed by the rule of terminal punctuation"),
529
+ DV(
530
  "data/sample_terminal_punc.json",
531
  0,
532
  "Sample documents with lines that are removed by the rule of terminal punctuation",
533
+ ),
534
  ),
535
+
536
+
537
  H4('2.1 Word "Javascript"'),
538
  P("""
539
  In C4 [5], the authors remove any line with the word "Javascript" since they found that many of the scraped
 
542
  propose to refine the strategy by adding one more keyword to the word "javascript" to avoid false positives.
543
  The additional keyword could be any one of “enable” / “disable” / “require” / “activate” / “browser”.
544
  """),
545
+ Details(
546
+ Summary("Sample documents that are removed by original C4 javascript rule but are kept after our refinement"),
547
+ DV(
548
+ "data/sample_java.jsonl",
549
+ 0,
550
+ "Sample documents that are removed by original C4 javascript rule but are kept after our refinement",
551
+ ),
552
  ),
553
  H4("2.2 Other Rules from RefinedWeb"),
554
  P("""
 
558
  - The line matches the pattern “r'^\\d+\\s+likes$'”,
559
  - The line contains only one word.
560
  """),
561
+ Details(
562
+ Summary("Sample documents with lines that are removed by the RefinedWeb rules"),
563
+ DV(
564
+ "data/sample_refinedweb_line.json",
565
+ 0,
566
+ "Sample documents with lines that are removed by the RefinedWeb rules",
567
+ ),
568
  ),
569
  H4("2.3 Toxic Lines"),
570
  P("""
 
574
  line is in the first 3 lines or in the last 3 lines) to remove toxic lines. Specifically, we do not only consider
575
  the bad words from English but also consider the bad words from other languages.
576
  """),
577
+ Details(
578
+ Summary("Sample documents with toxic lines"),
579
+ DVS(
580
+ json.load(open("data/toxic_lines.json")),
581
+ "Sample documents with toxic lines",
582
+ ),
583
  ),
584
+
585
  H3("3. Document-Level Filtering"),
586
  P("""
587
  In this section, we introduce all the quality signals that we have used to filter out low-quality documents.