hunterhector commited on
Commit
4cc0103
1 Parent(s): a4dc57a

a lot of text fixes.

Browse files
Files changed (5) hide show
  1. bibliography.bib +9 -0
  2. curated.py +44 -16
  3. main.py +22 -22
  4. results.py +40 -36
  5. web.py +4 -4
bibliography.bib CHANGED
@@ -529,3 +529,12 @@ url = {https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md}
529
  pages={994-998},
530
  keywords={Couplings;Databases;Data mining;Algorithm design and analysis;Social network services;Feature extraction;Cleaning;Transitive Closure;Connected Components;Large Scale Graphs;Hadoop;MapReduce},
531
  doi={10.1109/ICCNC.2014.6785473}}
 
 
 
 
 
 
 
 
 
 
529
  pages={994-998},
530
  keywords={Couplings;Databases;Data mining;Algorithm design and analysis;Social network services;Feature extraction;Cleaning;Transitive Closure;Connected Components;Large Scale Graphs;Hadoop;MapReduce},
531
  doi={10.1109/ICCNC.2014.6785473}}
532
+ @misc{lozhkov2024starcoder2stackv2,
533
+ title={StarCoder 2 and The Stack v2: The Next Generation},
534
+ author={Anton Lozhkov and Raymond Li and Loubna Ben Allal and Federico Cassano and Joel Lamy-Poirier and Nouamane Tazi and Ao Tang and Dmytro Pykhtar and Jiawei Liu and Yuxiang Wei and Tianyang Liu and Max Tian and Denis Kocetkov and Arthur Zucker and Younes Belkada and Zijian Wang and Qian Liu and Dmitry Abulkhanov and Indraneil Paul and Zhuang Li and Wen-Ding Li and Megan Risdal and Jia Li and Jian Zhu and Terry Yue Zhuo and Evgenii Zheltonozhskii and Nii Osae Osae Dade and Wenhao Yu and Lucas Krauß and Naman Jain and Yixuan Su and Xuanli He and Manan Dey and Edoardo Abati and Yekun Chai and Niklas Muennighoff and Xiangru Tang and Muhtasham Oblokulov and Christopher Akiki and Marc Marone and Chenghao Mou and Mayank Mishra and Alex Gu and Binyuan Hui and Tri Dao and Armel Zebaze and Olivier Dehaene and Nicolas Patry and Canwen Xu and Julian McAuley and Han Hu and Torsten Scholak and Sebastien Paquet and Jennifer Robinson and Carolyn Jane Anderson and Nicolas Chapados and Mostofa Patwary and Nima Tajbakhsh and Yacine Jernite and Carlos Muñoz Ferrandis and Lingming Zhang and Sean Hughes and Thomas Wolf and Arjun Guha and Leandro von Werra and Harm de Vries},
535
+ year={2024},
536
+ eprint={2402.19173},
537
+ archivePrefix={arXiv},
538
+ primaryClass={cs.SE},
539
+ url={https://arxiv.org/abs/2402.19173},
540
+ }
curated.py CHANGED
@@ -16,7 +16,7 @@ overview = (
16
  H2("Curated Sources Processing"),
17
  H3("What This Section Contains"),
18
  P(
19
- "This section provides a complete discussion on the filtering applied to the 14 curated sources that comprise the non-web data section of TxT360. The section is split into the following topic areas: "
20
  ),
21
  Ul(
22
  Li("Curated Sources Data Processing Summary", style="margin-bottom: 5px"),
@@ -30,18 +30,16 @@ overview = (
30
  )
31
 
32
  curated_sources_intro = Div(
33
- H2("Curated Sources in TxT360"),
34
  P(
35
- "Curated sources comprise high-quality datasets that contain domain-specificity.",
36
- B(
37
- " TxT360 was strongly influenced by The Pile",
38
- D_cite(bibtex_key="thepile"),
39
- " regarding both inclusion of the dataset and filtering techniques.",
40
- ),
41
- " These sources, such as Arxiv, Wikipedia, and Stack Exchange, provide valuable data that is excluded from the web dataset mentioned above. Analyzing and processing non-web data can yield insights and opportunities for various applications. Details about each of the sources are provided below. ",
42
  ),
43
  P(
44
- "TxT360 respects the copyright of the data sources and have not included the controversial data that was used in The Pile like YouTube and Opensubtitles, Reddit threads, and books."
45
  ),
46
  )
47
 
@@ -1198,10 +1196,22 @@ filtering_process = Div(
1198
  ". The dataset was parsed using the Story ID. In this dataset each post is a story, and each reply is considered subsequent story. Story IDs were considered between ID 1 to 37500000. The URL for all Story IDs was pinged. If that ID returned an error, the ID was removed. Each request was given a 2 second wait to account for network time.",
1199
  ),
1200
  P(
1201
- "The HackerNews dataset contains a vast amount of stories and is known for lively discussions. Due to the number of replies a story may contain, only longest comment thread for each story was sampled past level 3. All stories included the title (1st level) and all direct replies (2nd level)."
1202
  ),
1203
  P(B("Unique Data Preparation Challenges: ")),
1204
  Ul(
 
 
 
 
 
 
 
 
 
 
 
 
1205
  Li(
1206
  "As discussed above, the comment heirarchies required a thoughful approach to extracting meaningful data. ",
1207
  style="margin-bottom: -3px",
@@ -1279,6 +1289,7 @@ filtering_process = Div(
1279
  "All content was downloaded leading to high number of documents filtered during local deduplication. Following The Pile, priority was given to plain_text first, followed by the columns in the table in reverse order."
1280
  ),
1281
  P(B("Unique Data Preparation Challenges: ")),
 
1282
  Ul(
1283
  Li(
1284
  "Consecutive whitespaces and tabs were found. Consecutive Whitespaces and tabes were reduce to one, single whitespace.",
@@ -1296,7 +1307,11 @@ filtering_process = Div(
1296
  "Converted all single new lines to whitespace. If whitespace was found after a new line with no text, the whitespace was removed. All leading and trailing whitespace was removed.",
1297
  style="margin-bottom: -3px",
1298
  ),
1299
- Li("All \f characters were removed.", style="margin-bottom: -3px"),
 
 
 
 
1300
  ),
1301
  P(B("Filters Applied: ")),
1302
  Ul(
@@ -1422,6 +1437,9 @@ filtering_process = Div(
1422
  "Similar to the HackerNews challenges, we had to map comments and sub-comments to the original question.",
1423
  style="margin-bottom: -3px",
1424
  ),
 
 
 
1425
  ),
1426
  P(B("Filters Applied: ")),
1427
  Ul(
@@ -1457,7 +1475,7 @@ filtering_process = Div(
1457
  P(B("Unique Data Preparation Challenges: ")),
1458
  Ul(
1459
  Li(
1460
- "A byte string was included at the beginning of new lines",
1461
  style="margin-bottom: -3px",
1462
  ),
1463
  Li('No space before keyword "Answer:"', style="margin-bottom: -3px"),
@@ -1500,15 +1518,25 @@ filtering_process = Div(
1500
  P(B("Unique Data Preparation Challenges: ")),
1501
  Ul(
1502
  Li(
1503
- "Consecutive whitespaces were found spanning 10+ whitespace entries. These whitespaces were reduce to one, single whitespace.",
 
 
 
 
1504
  style="margin-bottom: -3px",
1505
  ),
1506
  Li(
1507
- "Consecutive new lines were found in some documents. All consecutive news over two were were reduce to two new lines.",
 
 
 
 
 
 
1508
  style="margin-bottom: -3px",
1509
  ),
1510
  Li(
1511
- "Delimiters such as * * * * * * * * ? were found. They were removed and replaced with whitespace.",
1512
  style="margin-bottom: -3px",
1513
  ),
1514
  ),
 
16
  H2("Curated Sources Processing"),
17
  H3("What This Section Contains"),
18
  P(
19
+ "This section provides a complete discussion on the filtering applied to the 14 curated sources that comprise the non-Common Crawl data section of TxT360. The section is split into the following topic areas: "
20
  ),
21
  Ul(
22
  Li("Curated Sources Data Processing Summary", style="margin-bottom: 5px"),
 
30
  )
31
 
32
  curated_sources_intro = Div(
33
+ H2("Domain Specific Curated Sources"),
34
  P(
35
+ "While massive amount of data can be crawled and obtained from the Internet, there are certain sources contain data in additional formats (e.g. PDF documents), or organized and published as official dumps (e.g. Wikipedia). We refer to these sources as curated sources. These dataset often comprises high-quality data that contain domain-specificity, such as academic publications or domain specific discussions. TxT360 was strongly influenced by The Pile",
36
+ D_cite(bibtex_key="thepile"),
37
+ " regarding both inclusion of the dataset and filtering techniques.",
38
+ ),
39
+ P("These sources, such as Arxiv, Wikipedia, and Stack Exchange, provide high quality data. And as mentioned above, they are excluded from the web dataset via URL matching. Details about each of the sources are provided below. ",
 
 
40
  ),
41
  P(
42
+ "TxT360 respects the copyright of the data sources and have not included the controversial data that was used in The Pile like YouTube and Opensubtitles, Reddit threads, and book3."
43
  ),
44
  )
45
 
 
1196
  ". The dataset was parsed using the Story ID. In this dataset each post is a story, and each reply is considered subsequent story. Story IDs were considered between ID 1 to 37500000. The URL for all Story IDs was pinged. If that ID returned an error, the ID was removed. Each request was given a 2 second wait to account for network time.",
1197
  ),
1198
  P(
1199
+ "The HackerNews dataset contains a vast amount of stories and is known for lively discussions. Due to the number of replies a story may contain, only longest comment thread for each story was sampled past level 3. All stories included the title (1st level) and all direct replies (2nd level). We may consider relax this constrain and extract more data."
1200
  ),
1201
  P(B("Unique Data Preparation Challenges: ")),
1202
  Ul(
1203
+ Li(
1204
+ "The converesation and forum style structure can be a very helpful signal for language model training. During processing the dataset, we try to encode such structure but without introducing too much noise. We choose to use an",
1205
+ D_code("<AUTHOR>", language="html"),
1206
+ " tag to encode the main thread text by the original poster, and use a ",
1207
+ D_code("<COMMENT>", language="html"),
1208
+ " tag to encode the replies. We initially choose ",
1209
+ D_code("<P>", language="html"),
1210
+ " as a tag since it is used by some instruction tuning dataset, but realize the ",
1211
+ D_code("<P>", language="html"),
1212
+ " tag can easily conflict with the original text.",
1213
+ style="margin-bottom: -3px",
1214
+ ),
1215
  Li(
1216
  "As discussed above, the comment heirarchies required a thoughful approach to extracting meaningful data. ",
1217
  style="margin-bottom: -3px",
 
1289
  "All content was downloaded leading to high number of documents filtered during local deduplication. Following The Pile, priority was given to plain_text first, followed by the columns in the table in reverse order."
1290
  ),
1291
  P(B("Unique Data Preparation Challenges: ")),
1292
+ P("The Freelaw text uses a lot of whitespaces and newlines to format the document visually. These lines are not necessary for language model learning and sometimes have confusing semantic meanings. We attempt to unify how whitespaces appear in this dataset with the following heuristics."),
1293
  Ul(
1294
  Li(
1295
  "Consecutive whitespaces and tabs were found. Consecutive Whitespaces and tabes were reduce to one, single whitespace.",
 
1307
  "Converted all single new lines to whitespace. If whitespace was found after a new line with no text, the whitespace was removed. All leading and trailing whitespace was removed.",
1308
  style="margin-bottom: -3px",
1309
  ),
1310
+ Li(
1311
+ "All form feed (",
1312
+ D_code("\\f", language="bash"),
1313
+ ")characters were removed.", style="margin-bottom: -3px"
1314
+ ),
1315
  ),
1316
  P(B("Filters Applied: ")),
1317
  Ul(
 
1437
  "Similar to the HackerNews challenges, we had to map comments and sub-comments to the original question.",
1438
  style="margin-bottom: -3px",
1439
  ),
1440
+ Li(
1441
+ "The dataset comes with the usernames of post authors. We attempt to replace them with strings such as <USER1> to remove the PII. This step might also reduce the language model's effort to memorizing the user names."
1442
+ ),
1443
  ),
1444
  P(B("Filters Applied: ")),
1445
  Ul(
 
1475
  P(B("Unique Data Preparation Challenges: ")),
1476
  Ul(
1477
  Li(
1478
+ "In one of our versions, we save the string as a byte string instead of raw text, introducing addition byte indicators at the string level",
1479
  style="margin-bottom: -3px",
1480
  ),
1481
  Li('No space before keyword "Answer:"', style="margin-bottom: -3px"),
 
1518
  P(B("Unique Data Preparation Challenges: ")),
1519
  Ul(
1520
  Li(
1521
+ "The original books uses a lot of witespaces to format the text, similar to the case of FreeLaw. Sometimes, 10+ consecutive whitespaces were found. These whitespaces were reduce to one, single whitespace.",
1522
+ style="margin-bottom: -3px",
1523
+ ),
1524
+ Li(
1525
+ "For similar reasons, consecutive new lines were found in some documents. All consecutive news over two were were reduce to two new lines.",
1526
  style="margin-bottom: -3px",
1527
  ),
1528
  Li(
1529
+ "The books are formmated with end-of-line hyphenation and break a single words into two lines. Hence a regular word such as ",
1530
+ D_code("text", language="bash"),
1531
+ " could become ",
1532
+ D_code("te-\\nxt", language="bash"),
1533
+ ". We detect the combination of ",
1534
+ D_code("-\\n", language="bash"),
1535
+ " and remove them to the origin word heuristically.",
1536
  style="margin-bottom: -3px",
1537
  ),
1538
  Li(
1539
+ "Text delimiters such as * * * * * * * * were used to indicate structures like sections. We removed such known delimiters and replaced them with proper whitespaces and new lines. For others, we make sure there are no additional leading or trailing whitepsaces.",
1540
  style="margin-bottom: -3px",
1541
  ),
1542
  ),
main.py CHANGED
@@ -77,7 +77,7 @@ front_matter = {
77
  "affiliationURL": "",
78
  },
79
  {
80
- "author": "An Li",
81
  "authorURL": "https://huggingface.co/an1118",
82
  "affiliation": "UCSD",
83
  "affiliationURL": "",
@@ -216,7 +216,7 @@ def main():
216
  ),
217
  Div(
218
  A(
219
- "Web Data Processing",
220
  href="#section21",
221
  )
222
  ),
@@ -256,7 +256,7 @@ def main():
256
  ),
257
  Div(
258
  A(
259
- "Curated Sources Processing",
260
  href="#section31",
261
  )
262
  ),
@@ -853,9 +853,12 @@ def intro():
853
  return Div(
854
  Section(
855
  H2("About TxT360"),
856
- P( "TL;DR ",
857
- B("We introduce TxT360 (Trillion eXtracted Text), the first dataset to globally deduplicate 99 CommonCrawl snapshots and 14 high-quality data sources from diverse domains (e.g., FreeLaw, PG-19, etc.). The large-scale deduplication process and rich metadata stored enables precise control over data distribution. We demonstrate a simple but effective upsampling recipe that creates a 15+ trillion-token corpus, outperforming FineWeb 15T on several key metrics. With the information, TxT360 empowers pre-trainers to explore more advanced weighting techniques, a feature not commonly available in previous pre-training datasets. In line with our 360° open source spirit, we document all detailed steps, reasons of our decisions, detailed statistics and more, in additional to the dataset itself. We hope this can serve as a useful resource for future developers."
858
- )
 
 
 
859
  ),
860
  plotly2fasthtml(all_eval_res_figs["MMLU"]),
861
  P(
@@ -865,52 +868,49 @@ def intro():
865
  D_cite(bibtex_key="c4"),
866
  D_cite(bibtex_key="muennighoff2023scaling"),
867
  D_cite(bibtex_key="dolma"),
868
- ", TxT360 carefully implements data processing steps including extraction, filtering, deduplication, personally identifiable information removal, and other steps.",
869
- ),
870
- P(
871
- "Metadata is stored along the processing stpes, enabling fine-grained control to create data distributions and corpus of desired size. As an example, we present one simple upsampling scheme that takes into account the duplication counts, resulting in a 15~16 trillion token corpus, outperforming FineWeb and our non-upsampling baselines, on diverse evaluations. Unlike DCLM",
872
  D_cite(bibtex_key="dclm"),
873
  "and RedPajama V2,",
874
  D_cite(bibtex_key="redpajama-v2"),
875
- "we present the final deduplicated dataset that is ready to go.",
876
- ),
877
- P(
878
- "In line with our 360° open-source initiative, we’ve documented all implementation details in this blog post and will be open-sourcing the code soon (stay tuned!). We also provide examples of each filter along with the rationale behind every decision, with the goal of informing and inspiring future work."
879
  ),
880
  id="section11",
881
  ),
882
  Section(
883
  H2("Why TxT360"),
884
- H3(
885
- "TxT360 is the first dataset to combine both web and curated data sources commonly used in pretraining."
886
  ),
887
  new_table_div_1,
888
  # table_div_1,
889
  # table_div_2,
890
  P(
891
- "In pretraining, it is common to combine web data and curated sources (cite). Web data is included to provide a vast quantity of long tail and diverse data, while curated datasets are often information rich and provide the 'deep-dive' domain information. Combining both datasets plays a critical role for effective LLM pre-training. By integrating the reach of web data with the quality of curated sources, TxT360 meets and surpasses the rigorous standards required for state-of-the-art LLM pre-training. See Results section below."
892
  ),
893
  P(
894
- "** TxT360 does not include code. This decision was made due to the perceived low duplication code with other sources."
 
 
895
  ),
896
  # P("Table 2: Basic TxT360 Statistics."),
897
  # table_div_data,
898
  id="section12",
899
  ),
900
  Section(
901
- H2("Our Generalizable Approach to Data Processing"),
902
  P(
903
- "To produce TxT360, a comprehensive and transparent data processing pipeline was designed to account for the nuances of both web and curated datasets. The pipeline presents a unified framework for processing both data types, making it convenient and easily adaptive for users to revise and fine-tune the pipeline for their own use cases."
904
  ),
905
  P(
906
  "Web datasets are inherently noisy and varied. The TxT360 pipeline implements sophisticated filtering and deduplication techniques to clean and remove redundancies while preserving data integrity."
907
  ),
908
  P(
909
- "Curated datasets are typically structured and consistently formatted. TxT360 filters these sources with selective steps to maintain their integrity while providing seamless integration into the larger dataset. Both data source types are globally deduplicated together resulting in 5.7T tokens of high-quality data. The table below shows the source distribution of TxT360 tokens."
 
910
  ),
911
  table_div_data,
912
  P(
913
- "We provide details and context for the choices behind TxT360 in the respective Web Data Processing and Curated Source Processing section. A deep dive describing the deduplication process can be found in the Shared Processing Steps section."
914
  ),
915
  # Img(src="images/pipeline.png", height="300", width="600"),
916
  # P(
 
77
  "affiliationURL": "",
78
  },
79
  {
80
+ "author": "Li An",
81
  "authorURL": "https://huggingface.co/an1118",
82
  "affiliation": "UCSD",
83
  "affiliationURL": "",
 
216
  ),
217
  Div(
218
  A(
219
+ "Common Crawl Data",
220
  href="#section21",
221
  )
222
  ),
 
256
  ),
257
  Div(
258
  A(
259
+ "Curated Sources",
260
  href="#section31",
261
  )
262
  ),
 
853
  return Div(
854
  Section(
855
  H2("About TxT360"),
856
+ P( B("TL;DR "),
857
+ "We introduce ",
858
+ A(B("TxT360 (Trillion eXtracted Text),"), href="https://huggingface.co/datasets/LLM360/TxT360"),
859
+ " the first dataset to globally deduplicate 99 CommonCrawl snapshots and 14 high-quality data sources from diverse domains (e.g., FreeLaw, PG-19, etc.). The large-scale deduplication process and rich metadata stored enables precise control over data distribution. We demonstrate a simple but effective upsampling recipe that creates a 15+ trillion-token corpus, outperforming FineWeb 15T on several key metrics. With the information, TxT360 empowers pre-trainers to explore more advanced weighting techniques, a feature not commonly available in previous pre-training datasets. Our findings highlight the importance of both high-quality data sources and appropriate weighting for optimal blending in LLM training."
860
+ ),
861
+ P("In line with our 360° open source spirit, we document all detailed steps, reasons of our decisions, detailed statistics, our code (stay tuned!), analysis results and more, in additional to the dataset itself. We hope this can serve as a useful resource for future developers."
862
  ),
863
  plotly2fasthtml(all_eval_res_figs["MMLU"]),
864
  P(
 
868
  D_cite(bibtex_key="c4"),
869
  D_cite(bibtex_key="muennighoff2023scaling"),
870
  D_cite(bibtex_key="dolma"),
871
+ ", TxT360 carefully implements data processing steps including extraction, filtering, deduplication, personally identifiable information removal, and other steps. Unlike DCLM",
 
 
 
872
  D_cite(bibtex_key="dclm"),
873
  "and RedPajama V2,",
874
  D_cite(bibtex_key="redpajama-v2"),
875
+ "we also hope to provide a dataset at this scale that is ready to go, without requiring futher filtering."
 
 
 
876
  ),
877
  id="section11",
878
  ),
879
  Section(
880
  H2("Why TxT360"),
881
+ P(
882
+ "In this year we have seen excellent datasets released by the community. Among those, most datasets focus on one source (e.g., crawled websites, code bases, papers). However, it is not trivial to combine these sources together due to the potential duplicaiton across them. TxT360 is the first dataset to combine most of sources commonly used in pretraining."
883
  ),
884
  new_table_div_1,
885
  # table_div_1,
886
  # table_div_2,
887
  P(
888
+ "In LLM pretraining, it is common to combine all possible text sources due to the Scaling Law. Crawled web pages are included to provide a vast quantity of data which can cover long tail and diverse information, while curated datasets such as Wikipedia are also used, which often provide the 'deep-dive' domain information. By integrating the reach of web data with the quality of curated sources, TxT360 meets and surpasses the rigorous standards required for state-of-the-art LLM pre-training."
889
  ),
890
  P(
891
+ "** TxT360 does not include very specific domains such as code and math. This decision was made due to the perceived low duplication code with other sources, and the different logic requiring to build those datasets. We leave those work to future work and recommend users refer to existing projects such as Stack V2",
892
+ D_cite(bibtex_key="lozhkov2024starcoder2stackv2"),
893
+ ".",
894
  ),
895
  # P("Table 2: Basic TxT360 Statistics."),
896
  # table_div_data,
897
  id="section12",
898
  ),
899
  Section(
900
+ H2("Our Approach"),
901
  P(
902
+ "To produce TxT360, a comprehensive data processing pipeline was designed to account for the nuances of both web and curated datasets. The pipeline presents a unified framework for processing both data types, making it convenient and easily adaptive for users to revise and fine-tune the pipeline for their own use cases."
903
  ),
904
  P(
905
  "Web datasets are inherently noisy and varied. The TxT360 pipeline implements sophisticated filtering and deduplication techniques to clean and remove redundancies while preserving data integrity."
906
  ),
907
  P(
908
+ "Curated datasets are typically structured and consistently formatted, but also can cause troubles with their own special formatting preferences. TxT360 filters these sources with selective steps to maintain their integrity while providing seamless integration into the larger dataset. Both data source types are globally deduplicated together resulting in ~5T tokens of high-quality data. The table below shows the source distribution of TxT360 tokens. ",
909
+ B("Note that we do not recommend to use the raw distribution of the deduplicated dataset, a simple recipe is provided in the studies section."),
910
  ),
911
  table_div_data,
912
  P(
913
+ "We provide details and context for the choices behind TxT360 in the respective Common Crawl Data Processing and Curated Source Processing section. A deep dive describing the deduplication process can be found in the Shared Processing Steps section."
914
  ),
915
  # Img(src="images/pipeline.png", height="300", width="600"),
916
  # P(
results.py CHANGED
@@ -144,7 +144,7 @@ for bucket, perplexities in data.items():
144
 
145
  # Update layout
146
  fig22.update_layout(
147
- title="Perplexity Across Different Years (Global)",
148
  xaxis_title="Year",
149
  yaxis_title="Average Perplexity",
150
  legend_title="Bucket (duplicate count range)"
@@ -254,7 +254,7 @@ for year, values in data.items():
254
 
255
  # Update layout
256
  fig.update_layout(
257
- title="Perplexity Across Different Dump Duplication Counts (global)",
258
  xaxis_title="Number of Dumps Duplication",
259
  yaxis_title="Average Perplexity",
260
  legend_title="Year"
@@ -296,7 +296,7 @@ for year, values in data.items():
296
 
297
  # Update layout
298
  fig.update_layout(
299
- title="Perplexity Across Different Buckets (local)",
300
  xaxis_title="Bucket (Duplicate Count Range)",
301
  yaxis_title="Average Perplexity",
302
  legend_title="Year"
@@ -403,7 +403,7 @@ for year, values in data.items():
403
 
404
  # Update layout
405
  fig.update_layout(
406
- title="Perplexity Across Different Dump Duplication Counts (local)",
407
  xaxis_title="Number of Dumps Duplication",
408
  yaxis_title="Average Perplexity",
409
  legend_title="Year"
@@ -442,7 +442,7 @@ for year, perplexities in data.items():
442
 
443
  # Update layout
444
  fig.update_layout(
445
- title="Perplexity Across Different Buckets (Global)",
446
  xaxis_title="Bucket (duplicate count range)",
447
  yaxis_title="Average Perplexity",
448
  legend_title="Year"
@@ -477,7 +477,7 @@ for bucket, perplexities in data.items():
477
 
478
  # Update layout
479
  fig.update_layout(
480
- title="Perplexity Across Different Years (Global)",
481
  xaxis_title="Year",
482
  yaxis_title="Average Perplexity",
483
  legend_title="Bucket (duplicate count range)"
@@ -543,7 +543,7 @@ for year, year_data in data.items():
543
 
544
  # Update layout
545
  fig.update_layout(
546
- title="Perplexity Across Different Dump Duplication Counts (Global)",
547
  xaxis_title="Number of Dumps Duplication",
548
  yaxis_title="Average Perplexity",
549
  legend_title="Year"
@@ -611,7 +611,7 @@ for year, year_data in data.items():
611
 
612
  # Update layout
613
  fig.update_layout(
614
- title="Perplexity Across Different Buckets (Local)",
615
  xaxis_title="Bucket (Duplicate Count Range)",
616
  yaxis_title="Average Perplexity",
617
  legend_title="Year"
@@ -675,7 +675,7 @@ for year, year_data in data.items():
675
 
676
  # Update layout
677
  fig.update_layout(
678
- title="Perplexity Across Different Dump Duplication Counts (Local)",
679
  xaxis_title="Number of Dumps Duplication",
680
  yaxis_title="Average Perplexity",
681
  legend_title="Year"
@@ -821,47 +821,48 @@ upsampling_exp = Div(
821
  preplexity_intro_div = Div(
822
  H2("Perplexity Evaluation on Duplicate Data"),
823
  H3("Model based Quality Estimation"),
824
- P("We took one of the model-based data quality evaluation strategies adopted by", A("DataComp-LM",href="https://arxiv.org/abs/2406.11794"), "which used perplexity filtering as a candidate for quality filtering. DataComp-LM followed", A("CCNet’s",href="https://arxiv.org/abs/1911.00359"), "practice to use a 5-gram Kneser-Ney model as implemented in the",A("KenLM",href="https://github.com/kpu/kenlm"), "library for efficient perplexity calculation. Following this practice, we estimated data quality by taking a", A("KenLM model",href="https://huggingface.co/edugp/kenlm"), "trained on English Wikipedia data to compute perplexity on data with different duplication patterns. Lower perplexity is regarded as a signal of higher quality."),
825
  H3("Sampling Strategy"),
826
- P("We started from a processed Common Crawl (CC) ablation dataset divided by the number of duplicates of each document. For each CC dump, we have different buckets each holding chunks of document with different duplicate count ranges (1-1, 2-5, 6-10, 11-100, 101-1000, 1001-30000000). We sampled the first 10k documents from each chunk with their meta data."),
827
  )
828
 
829
 
830
  perp1_div = Div(
 
 
 
 
 
 
 
831
  Section(
832
- H3("Perplexity vs Buckets"),
833
- P("For each bucket, we aggregated all the chunks that belong to a single year and calculated the average perplexity for each (bucket, year) data point."),
834
- #Img(src="images/prep-diff-buckets-global.png", height = "300", width = "600" ),
835
- plotly2fasthtml(Perplexity_Across_Different_Buckets_global_graph),
836
- ),
837
- Section(
838
- H3("Perplexity vs Years"),
839
- P("Taking the same data, we can convert it into a graph indicating the yearly trend. For most buckets, the average perplexity of dumps from more recent years seem to be lower than that of former years."),
840
  #Img(src="images/prep-across-diff-year-global-dup-buckets.png", height = "300", width = "600" ),
841
  plotly2fasthtml(graph2222),
842
 
843
  ),
844
  Section(
845
- H3("Perplexity vs Document Duplication"),
846
- P("We can also break each bucket into distinct document counts. The graph becomes a bit noisy at the end because of insufficient samples with larger duplication counts."),
847
  #Img(src="images/prep-across-diff-docs-dup-count-global.png", height = "300", width = "600" ),
848
  plotly2fasthtml(graph3),
849
  ),
850
  Section(
851
- H3("Perplexity vs Dump Duplication"),
852
- P("We are also interested in how the number of dumps a document is in affect data quality. From the graph below we can see that documents that are duplicated across around 40 - 60 dumps usually have lower perplexity."),
853
  #Img(src="images/prep-across-diff-dump-dup-counts-global.png", height = "300", width = "600" ),
854
  plotly2fasthtml(graph4),
855
  ),
856
  Section(
857
- H3("Perplexity vs Local Buckets"),
858
- P("Previously we have seen that documents in recent dumps tend to have lower perplexity. This might be related to the way how global deduplication was implemented. During global deduplication, we only keep copy in the latest dump. Hence documents that are duplicated across multiple dumps only appear in the latest one. To avoid bias brought by this strategy, we tried to recover the states before the global deduplication by reading the metadata attached with each document."),
859
  #Img(src="images/prep-across-diff-buckets-local.png", height = "300", width = "600" ),
860
  plotly2fasthtml(graph5),
861
  ),
862
  Section(
863
- H3("Perplexity vs Local Dump Duplication"),
864
- P("Following the same practice, we can plot the local version of the graph of average perplexity with respect to dump duplication."),
865
  #Img(src="images/prep-diff-dump-dump-counts-local.png", height = "300", width = "600" ),
866
  plotly2fasthtml(graph6),
867
  ),
@@ -874,27 +875,27 @@ llama_div = Div(
874
  P("For comparison purpose, we run the same perplexity evaluation with llama 3.1 8B model."),
875
  ),
876
  Section(
877
- H3("Perplexity vs Buckets"),
878
  #Img(src="images/perp-across-diff-buckets-global.png", height = "300", width = "600" ),
879
  plotly2fasthtml(llama_graph1),
880
  ),
881
  Section(
882
- H3("Perplexity vs Years"),
883
  #Img(src="images/prep-across-diff-years-global.png", height = "300", width = "600" ),
884
  plotly2fasthtml(llama_graph2),
885
  ),
886
  Section(
887
- H3("Perplexity vs Dump Duplication"),
888
  #Img(src="images/prep-vs-dump-dup-global.png", height = "300", width = "600" ),
889
  plotly2fasthtml(llama_graph4),
890
  ),
891
  Section(
892
- H3("Perplexity vs Local Buckets"),
893
  #Img(src="images/prep-diff-buckets-local.png", height = "300", width = "600" ),
894
  plotly2fasthtml(llama_graph5),
895
  ),
896
  Section(
897
- H3("Perplexity vs Local Dump Duplication"),
898
  #Img(src="images/prep-vs-dump-dup-global.png", height = "300", width = "600" ),
899
  plotly2fasthtml(llama_graph6),
900
  ),
@@ -928,16 +929,19 @@ for title, data in topic_charts:
928
  cluster_div = Div(
929
  Section(
930
  H2("Topic Analysis"),
931
- P("We tried to classify data into topic groups and looked for correlations between topics and statistics of data. Data from different topic groups should manifest different characteristics of distribution, which can give us some insight into the composition of dataset."),
932
  H3("Methodology"),
933
- P("We took the ", A("common crawl", href="https://commoncrawl.org/"), " data and clustered them into 17 topic groups using ", A("BERTopic", href="https://maartengr.github.io/BERTopic/index.html"), ". We collected and aggregated a series of metrics which include quality signals and other useful metadata. For each topic group, we calculated average scores and generated the corresponding bar charts over different metrics for comparison and analysis."),
934
  H3("Cluster Groups"),
935
- P("We grouped data into the following 17 clusters"),
936
  Ul(*(
937
  Li(topic_name, style = "margin-bottom: 5px")
938
  for topic_name in ("Arts", "Business & Economics & Finance", "Culture & Cultural geography", "Daily Life & Home & Lifestyle", "Education", "Entertainment & Travel & Hobby", "Environment", "Food & Drink & Cooking", "Health & Wellness & Medicine", "Law & Justice", "Natural Science & Formal Science & Technology", "Personal Development & Human Resources & Career", "Politics & Government", "Religion & Spirituality", "Shopping & Commodity", "Society & Social Issues & Human Rights", "Sports")
939
  )),
940
- H3("Results Analysis"),
 
 
 
941
  *(
942
  Section(H4(title), plotly2fasthtml(topic_graphs[i]), P(data.get("comment", '')))
943
  for i, (title, data) in enumerate(topic_charts)
 
144
 
145
  # Update layout
146
  fig22.update_layout(
147
+ title="Perplexity Across Different Years",
148
  xaxis_title="Year",
149
  yaxis_title="Average Perplexity",
150
  legend_title="Bucket (duplicate count range)"
 
254
 
255
  # Update layout
256
  fig.update_layout(
257
+ title="Perplexity Across Different Dump Duplication Counts",
258
  xaxis_title="Number of Dumps Duplication",
259
  yaxis_title="Average Perplexity",
260
  legend_title="Year"
 
296
 
297
  # Update layout
298
  fig.update_layout(
299
+ title="Perplexity Across Different Buckets",
300
  xaxis_title="Bucket (Duplicate Count Range)",
301
  yaxis_title="Average Perplexity",
302
  legend_title="Year"
 
403
 
404
  # Update layout
405
  fig.update_layout(
406
+ title="Perplexity Across Different Dump Duplication Counts",
407
  xaxis_title="Number of Dumps Duplication",
408
  yaxis_title="Average Perplexity",
409
  legend_title="Year"
 
442
 
443
  # Update layout
444
  fig.update_layout(
445
+ title="Perplexity Across Different Buckets",
446
  xaxis_title="Bucket (duplicate count range)",
447
  yaxis_title="Average Perplexity",
448
  legend_title="Year"
 
477
 
478
  # Update layout
479
  fig.update_layout(
480
+ title="Perplexity Across Different Years",
481
  xaxis_title="Year",
482
  yaxis_title="Average Perplexity",
483
  legend_title="Bucket (duplicate count range)"
 
543
 
544
  # Update layout
545
  fig.update_layout(
546
+ title="Perplexity Across Different Dump Duplication Counts",
547
  xaxis_title="Number of Dumps Duplication",
548
  yaxis_title="Average Perplexity",
549
  legend_title="Year"
 
611
 
612
  # Update layout
613
  fig.update_layout(
614
+ title="Perplexity Across Different Buckets",
615
  xaxis_title="Bucket (Duplicate Count Range)",
616
  yaxis_title="Average Perplexity",
617
  legend_title="Year"
 
675
 
676
  # Update layout
677
  fig.update_layout(
678
+ title="Perplexity Across Different Dump Duplication Counts",
679
  xaxis_title="Number of Dumps Duplication",
680
  yaxis_title="Average Perplexity",
681
  legend_title="Year"
 
821
  preplexity_intro_div = Div(
822
  H2("Perplexity Evaluation on Duplicate Data"),
823
  H3("Model based Quality Estimation"),
824
+ P("We took one of the model-based data quality evaluation strategies adopted by ", A("DataComp-LM",href="https://arxiv.org/abs/2406.11794"), " which used perplexity filtering as a candidate for quality filtering. The DCLM results show that a simple perplexity filter is still quite strong. DCLM followed ", A("CCNet’s",href="https://arxiv.org/abs/1911.00359"), " practice to use a 5-gram Kneser-Ney model as implemented in the ",A("KenLM",href="https://github.com/kpu/kenlm"), " library for efficient perplexity calculation. In order to gain more insights of our dataset, we also took a ", A("KenLM model",href="https://huggingface.co/edugp/kenlm"), " trained on English Wikipedia data to compute perplexity on data with different duplication patterns, and try to observe how such signals coorelate with the duplication patterns."),
825
  H3("Sampling Strategy"),
826
+ P("We took a early version of the TxT360 Common Crawl (CC) portion, and bucket the documents by the number of duplicates each has. For each CC snapshot, we bucket the documents by their duplicate counts in the following buckets (1, 2-5, 6-10, 11-100, 101-1000, 1001-infinite). We sampled the first 10k documents from each bucket."),
827
  )
828
 
829
 
830
  perp1_div = Div(
831
+ # this looks basically the same as the figure below, comment it out for now.
832
+ # Section(
833
+ # H3("Perplexity vs Buckets"),
834
+ # P("For each bucket, we aggregated all the chunks that belong to a single year and calculated the average perplexity for each (bucket, year) data point. We observe the perplexity is generally dropping. This could be biased since we always keep the newest document if we find a duplicate."),
835
+ # #Img(src="images/prep-diff-buckets-global.png", height = "300", width = "600" ),
836
+ # plotly2fasthtml(Perplexity_Across_Different_Buckets_global_graph),
837
+ # ),
838
  Section(
839
+ H3("Perplexity vs. Years"),
840
+ P("Taking the same data, we can convert it into a graph indicating the yearly trend. For most buckets, the average perplexity of dumps from more recent years seem to be lower than that of former years. This could be biased since we always keep the newest document if we find a duplicate."),
 
 
 
 
 
 
841
  #Img(src="images/prep-across-diff-year-global-dup-buckets.png", height = "300", width = "600" ),
842
  plotly2fasthtml(graph2222),
843
 
844
  ),
845
  Section(
846
+ H3("Perplexity vs. Document Duplication"),
847
+ P("Instead of bucketing, we also plot the relationship between perplexity versus the number of duplicates directly. The graph becomes a bit noisy at the end because of insufficient samples with larger duplication counts. However, we can observe that there seems to be a lower point at around 10-20 duplicates. To see the results more clearly, we recommend you turn of other years and only look at one year, and zoom in to 0-100 region on the X axis."),
848
  #Img(src="images/prep-across-diff-docs-dup-count-global.png", height = "300", width = "600" ),
849
  plotly2fasthtml(graph3),
850
  ),
851
  Section(
852
+ H3("Perplexity vs. Dump Duplication"),
853
+ P("Fineweb hypothesize that documents appear across multiple snapshots (CC dumps) might be an indicator of quality. Hence, we also plot the perplexity versus the number of times a document appear in different snapshots. From the graph below we can see that documents that are duplicated across around 40 - 60 snapshots usually have lower perplexity."),
854
  #Img(src="images/prep-across-diff-dump-dup-counts-global.png", height = "300", width = "600" ),
855
  plotly2fasthtml(graph4),
856
  ),
857
  Section(
858
+ H3("Perplexity Plots before Global Deduplication"),
859
+ P("Previously we have seen that documents in recent snapshots tend to have lower perplexity. This might be related to the way how global deduplication was implemented. During global deduplication, we only keep copy in the latest dump. Hence documents that are duplicated across multiple dumps only appear in the latest one. To avoid bias brought by this strategy, we tried to recover the states before the global deduplication using the stored metadata (i.e., the locally deduplicted dataset state). This trends are a bit different. In the figure below, we do not observe a clear trend of which year has a higher quality, especially in the 2-10 bucket region."),
860
  #Img(src="images/prep-across-diff-buckets-local.png", height = "300", width = "600" ),
861
  plotly2fasthtml(graph5),
862
  ),
863
  Section(
864
+ H3("Perplexity vs. Dump Duplication before Global Deduplication"),
865
+ P("Following the same practice, we can plot the graph of average perplexity with respect to dump duplication count, before global deduplication. The conclusion is similar, that documents with a dump duplication count around 40-60 have the lower perplexity."),
866
  #Img(src="images/prep-diff-dump-dump-counts-local.png", height = "300", width = "600" ),
867
  plotly2fasthtml(graph6),
868
  ),
 
875
  P("For comparison purpose, we run the same perplexity evaluation with llama 3.1 8B model."),
876
  ),
877
  Section(
878
+ H3("Perplexity vs. Buckets"),
879
  #Img(src="images/perp-across-diff-buckets-global.png", height = "300", width = "600" ),
880
  plotly2fasthtml(llama_graph1),
881
  ),
882
  Section(
883
+ H3("Perplexity vs. Years"),
884
  #Img(src="images/prep-across-diff-years-global.png", height = "300", width = "600" ),
885
  plotly2fasthtml(llama_graph2),
886
  ),
887
  Section(
888
+ H3("Perplexity vs. Dump Duplication"),
889
  #Img(src="images/prep-vs-dump-dup-global.png", height = "300", width = "600" ),
890
  plotly2fasthtml(llama_graph4),
891
  ),
892
  Section(
893
+ H3("Perplexity vs. Buckets before Global Deduplication"),
894
  #Img(src="images/prep-diff-buckets-local.png", height = "300", width = "600" ),
895
  plotly2fasthtml(llama_graph5),
896
  ),
897
  Section(
898
+ H3("Perplexity vs. Dump Duplication Count before Global Deduplication"),
899
  #Img(src="images/prep-vs-dump-dup-global.png", height = "300", width = "600" ),
900
  plotly2fasthtml(llama_graph6),
901
  ),
 
929
  cluster_div = Div(
930
  Section(
931
  H2("Topic Analysis"),
932
+ P("In order to understand our dataset better, we tried to cluster our data into topic groups and examined for correlations between topics and other attributes of the documents. We suspect documents from different topic groups should manifest different characteristics of distribution, which can give us some insight into the composition of dataset."),
933
  H3("Methodology"),
934
+ P("We took an early version of the LLM360 Common Crawl portion and clustered them into 17 topic groups using ", A("BERTopic", href="https://maartengr.github.io/BERTopic/index.html"), ". We collected and aggregated a series of metrics from the stored metadata. For each topic group, we calculated average scores and generated the corresponding bar charts over different metrics for comparison and analysis."),
935
  H3("Cluster Groups"),
936
+ P("We grouped data into the following 17 clusters. These clusters are obtained by first clustered a seed portion of the dataset into 128 dumps, and then we manually inspect the clusters to combine 17 semantically meaningful ones."),
937
  Ul(*(
938
  Li(topic_name, style = "margin-bottom: 5px")
939
  for topic_name in ("Arts", "Business & Economics & Finance", "Culture & Cultural geography", "Daily Life & Home & Lifestyle", "Education", "Entertainment & Travel & Hobby", "Environment", "Food & Drink & Cooking", "Health & Wellness & Medicine", "Law & Justice", "Natural Science & Formal Science & Technology", "Personal Development & Human Resources & Career", "Politics & Government", "Religion & Spirituality", "Shopping & Commodity", "Society & Social Issues & Human Rights", "Sports")
940
  )),
941
+ H3("Topic vs. Various Metrics"),
942
+ P(
943
+ "In the following section, we plot the cluster against their average score of a particular metric stored in the metadta. We recommend the readers to jump to the ones you are most interested in."
944
+ ),
945
  *(
946
  Section(H4(title), plotly2fasthtml(topic_graphs[i]), P(data.get("comment", '')))
947
  for i, (title, data) in enumerate(topic_charts)
web.py CHANGED
@@ -376,10 +376,10 @@ def web_data():
376
  return Div(
377
  Section(
378
  Div(
379
- H1("Web Data Processing"),
380
- H2("Common Crawl Snapshot Processing"),
381
- H3("What This Section Contains"),
382
- P("This section provides a complete discussion on the filtering applied to the 99 Common Crawl snapshots that comprise the web data section of TxT360. The section is split into the following topic areas: "),
383
  Ul(
384
  Li("Web Data Processing Summary", style = "margin-bottom: 5px"),
385
  Li("Document Preparation", style = "margin-bottom: 5px"),
 
376
  return Div(
377
  Section(
378
  Div(
379
+ # H1("Web Data Processing"),
380
+ H2("Common Crawl Snapshot Processing"),
381
+ H3("What This Section Contains"),
382
+ P("This section provides a complete discussion on the filtering applied to the 99 Common Crawl snapshots that comprise the web data section of TxT360. The section is split into the following topic areas: "),
383
  Ul(
384
  Li("Web Data Processing Summary", style = "margin-bottom: 5px"),
385
  Li("Document Preparation", style = "margin-bottom: 5px"),