victormiller commited on
Commit
ad46307
1 Parent(s): 4d5ad99

Update web.py

Browse files
Files changed (1) hide show
  1. web.py +13 -13
web.py CHANGED
@@ -499,7 +499,7 @@ def web_data():
499
 
500
 
501
  P(B('"Word "Javascript"'), """
502
- In C4 [5], the authors remove any line with the word "Javascript" since they found that many of the scraped
503
  pages contained warnings stating that Javascript should be enabled. However, this filtering strategy is too
504
  strict, which will filter out many lines that are really talking about “Javascript”.
505
  """),
@@ -526,7 +526,7 @@ def web_data():
526
  """,
527
  ),
528
  P(B("Other Rules from RefinedWeb: "), """
529
- We also adopt rules from RefinedWeb [3] to remove lines if they satisfy any of the following criteria:
530
  """),
531
  Ul(
532
  Li("The line is only composed of uppercase characters,", style = "margin-bottom: 5px"),
@@ -597,19 +597,19 @@ def web_data():
597
  """,
598
  ),
599
  P("""Similar to previous sections, we will present sample documents filtered out by the given quality signals.
600
- Most quality signals were initially introduced by Gopher [2] and subsequently adopted by later
601
- studies ([3], [6], [4]). However, we observed that, despite following the same descriptions, the implementation
602
  of each quality signal can vary significantly among different dataset pipelines, resulting in disparate
603
  outcomes for the same quality signals.
604
- In our pipeline, we referenced earlier implementations that were publicly available such as Dolma [6], DataTrove [4],
605
- and RedPajama V2 [7], and selected the most suitable method based on manual inspections.
606
  """),
607
  P(B("Repetition-based Heuristics: "), """
608
  Many documents contain repeated sequences, potentially due to crawling errors or low-quality sources. In line with previous
609
- work ([2], [3], [6]), we choose to remove any document with excessive line, paragraph, or n-gram repetitions.
610
  """),
611
  P(B("Fraction of Characters in Repeated Lines: "), """
612
- Following Gopher [2], we remove documents containing mupltiple, short duplicate passages, as well as those with few,
613
  but longer duplicate passages. To achieve this goal, we calculate over the document both the fraction of passages
614
  that are duplicates, and the fraction of characters contained within those duplicated passages.
615
  """),
@@ -748,7 +748,7 @@ def web_data():
748
  """,
749
  ),
750
  P(B("Fraction of Characters in the Most Common N-grams (n=2,3,4): "), """
751
- Following Gopher [2], we remove documents with a high portion of n-grams. For each n ∈ (2, 3, 4), we calculate the
752
  fraction of characters contained within the most frequently-occurring n-gram.
753
  """),
754
  Details(
@@ -911,7 +911,7 @@ def web_data():
911
  """,
912
  ),
913
  P(B("Fraction of Characters in Duplicated N-grams (n=5,...,10): "), """
914
- Following Gopher [2], we remove documents with a high portion of n-grams. For each n ∈ (5, ..., 10), we calculate the
915
  fraction of characters contained within all duplicate n-grams, taking care not to count characters that occur in
916
  overlapping n-grams more than once.
917
  """),
@@ -1141,8 +1141,8 @@ def web_data():
1141
  ),
1142
  P(B("Line-wise Heuristics: "), """
1143
  Some line-wise information could also be helpful to distinguish low-quality and high-quality documents. Following
1144
- RefinedWeb [3], we remove the document if the corrected lines represent more than 5% of words. In line with previous
1145
- works ([2], [3], [6]), we remove the documents if more than 30% of the lines end with an ellipsis or more than
1146
  90% of lines start with a bullet point.
1147
  """),
1148
  Details(
@@ -1247,7 +1247,7 @@ def web_data():
1247
  ),
1248
 
1249
  P(B("Statistics-based Heuristics: "), """
1250
- We summarize other statistics-based rules originated from Gopher [7] in this section. The statistics can be used include:
1251
  """),
1252
  Ul(
1253
  Li("the word count in the document", style = "margin-bottom: 5px"),
 
499
 
500
 
501
  P(B('"Word "Javascript"'), """
502
+ In C4,""", D_cite(bibtex_key="c4"), """the authors remove any line with the word "Javascript" since they found that many of the scraped
503
  pages contained warnings stating that Javascript should be enabled. However, this filtering strategy is too
504
  strict, which will filter out many lines that are really talking about “Javascript”.
505
  """),
 
526
  """,
527
  ),
528
  P(B("Other Rules from RefinedWeb: "), """
529
+ We also adopt rules from RefinedWeb """, D_cite(bibtex_key="refinedweb"), """ to remove lines if they satisfy any of the following criteria:
530
  """),
531
  Ul(
532
  Li("The line is only composed of uppercase characters,", style = "margin-bottom: 5px"),
 
597
  """,
598
  ),
599
  P("""Similar to previous sections, we will present sample documents filtered out by the given quality signals.
600
+ Most quality signals were initially introduced by Gopher """, D_cite(bibtex_key="gopher"), """ and subsequently adopted by later
601
+ studies """, D_cite(bibtex_key="refinedweb"),D_cite(bibtex_key="dolma"),D_cite(bibtex_key="fineweb"), """([3], [6], [4]). However, we observed that, despite following the same descriptions, the implementation
602
  of each quality signal can vary significantly among different dataset pipelines, resulting in disparate
603
  outcomes for the same quality signals.
604
+ In our pipeline, we referenced earlier implementations that were publicly available such as Dolma,""", D_cite(bibtex_key="dolma"), """ DataTrove, """, D_cite(bibtex_key="penedo2024datatrove"), """
605
+ and RedPajama V2, """, D_cite(bibtex_key="redpajama-v2"), """ and selected the most suitable method based on manual inspections.
606
  """),
607
  P(B("Repetition-based Heuristics: "), """
608
  Many documents contain repeated sequences, potentially due to crawling errors or low-quality sources. In line with previous
609
+ work, """, D_cite(bibtex_key="gopher"), D_cite(bibtex_key="refinedweb"), D_cite(bibtex_key="dolma"), """ we choose to remove any document with excessive line, paragraph, or n-gram repetitions.
610
  """),
611
  P(B("Fraction of Characters in Repeated Lines: "), """
612
+ Following Gopher,""", D_cite(bibtex_key="gopher"), """ we remove documents containing mupltiple, short duplicate passages, as well as those with few,
613
  but longer duplicate passages. To achieve this goal, we calculate over the document both the fraction of passages
614
  that are duplicates, and the fraction of characters contained within those duplicated passages.
615
  """),
 
748
  """,
749
  ),
750
  P(B("Fraction of Characters in the Most Common N-grams (n=2,3,4): "), """
751
+ Following Gopher,""", D_cite(bibtex_key="gopher"), """ we remove documents with a high portion of n-grams. For each n ∈ (2, 3, 4), we calculate the
752
  fraction of characters contained within the most frequently-occurring n-gram.
753
  """),
754
  Details(
 
911
  """,
912
  ),
913
  P(B("Fraction of Characters in Duplicated N-grams (n=5,...,10): "), """
914
+ Following Gopher, we remove documents with a high portion of n-grams. For each n ∈ (5, ..., 10), we calculate the
915
  fraction of characters contained within all duplicate n-grams, taking care not to count characters that occur in
916
  overlapping n-grams more than once.
917
  """),
 
1141
  ),
1142
  P(B("Line-wise Heuristics: "), """
1143
  Some line-wise information could also be helpful to distinguish low-quality and high-quality documents. Following
1144
+ RefinedWeb, we remove the document if the corrected lines represent more than 5% of words. In line with previous
1145
+ works, we remove the documents if more than 30% of the lines end with an ellipsis or more than
1146
  90% of lines start with a bullet point.
1147
  """),
1148
  Details(
 
1247
  ),
1248
 
1249
  P(B("Statistics-based Heuristics: "), """
1250
+ We summarize other statistics-based rules originated from Gopher in this section. The statistics can be used include:
1251
  """),
1252
  Ul(
1253
  Li("the word count in the document", style = "margin-bottom: 5px"),