omkarenator commited on
Commit
09bef6a
·
1 Parent(s): 0698fac

move to spa, smooth srolling

Browse files
Files changed (6) hide show
  1. common.py +8 -8
  2. curated.py +4 -24
  3. main.py +34 -92
  4. results.py +4 -4
  5. style.css +4 -0
  6. web.py +5 -5
common.py CHANGED
@@ -299,7 +299,7 @@ global_div = Div(
299
  ),
300
  Li("Normalization Form C Discussion", style="margin-bottom: 5px"),
301
  ),
302
- id="section1",
303
  ),
304
  Section(
305
  H2("Motivation Behind Global Deduplication"),
@@ -331,7 +331,7 @@ global_div = Div(
331
  P(
332
  "Additionally, we maintained statistics about each matching clusters as they were formed during the final stage of deduplication. Below are the details of all four stages of our deduplication pipeline. We use Dask extensively throughout all stages of the deduplication. We have included the size of results of each stage on disk to give an idea about the scale:"
333
  ),
334
- id="section2",
335
  ),
336
  Section(
337
  H3("MinHash Generation"),
@@ -339,7 +339,7 @@ global_div = Div(
339
  "We use the datasketch library to generate MinHash signatures with the number of permutations to 128. Each signature is signature represented as a MinHash object for each document. Before calculating the signature, the text is cleaned by stripping whitespace, converting to lowercase, and removing punctuation, consecutive spaces, newlines, and tabs. Next, a list of 13-grams is generated to use as features for creating a document signature. The globally-unique document IDs and signatures are then saved to disk. The documented ID is designed by an encoding scheme which converts file names and line numbers (there is one document per line) to unique document IDs. This also helped a lot in saving disk and memory for this stage."
340
  ),
341
  P(B("This step produced 20 TB of hashes.")),
342
- id="section3",
343
  ),
344
  Section(
345
  H3("Matching Pairs Generation"),
@@ -351,7 +351,7 @@ global_div = Div(
351
  ),
352
  D_code(dask_algo, block="block", language="python"),
353
  P(B("This step produced 9.2 TB of matching pairs from all bands.")),
354
- id="section4",
355
  ),
356
  Section(
357
  H3("Finding Duplicate Pairs"),
@@ -369,7 +369,7 @@ global_div = Div(
369
  "The resulting unique pairs are then used to identify clusters of near-duplicates by finding connected components in a graph, where the vertices represent documents and the edges represent matches."
370
  ),
371
  P(B("This step produced 1.9 TB of unique pairs.")),
372
- id="section5",
373
  ),
374
  Section(
375
  H3("Finding Connected Components using MapReduce"),
@@ -389,7 +389,7 @@ global_div = Div(
389
  "Below is the distribution of duplicate documents found across different snapshots of CommonCrawl. The distribution is skewed to the right because the documents are bucketed by the dump ID of the document we retain, and we prefer documents from higher dump IDs."
390
  ),
391
  plotly2fasthtml(dup_docs_count_graph()),
392
- id="section6",
393
  ),
394
  Section(
395
  H3("Analysis of Near-Duplicate Clusters"),
@@ -434,7 +434,7 @@ global_div = Div(
434
  style="list-style-type: none",
435
  ),
436
  ),
437
- id="section7",
438
  ),
439
  Section(
440
  H2("Normalization Form C"),
@@ -454,7 +454,7 @@ global_div = Div(
454
  style="list-style-type: none",
455
  )
456
  ), # "background-color= gray" "color= blue" maybe add this later
457
- id="section8",
458
  ),
459
  Section(
460
  H3("NFC Examples"),
 
299
  ),
300
  Li("Normalization Form C Discussion", style="margin-bottom: 5px"),
301
  ),
302
+ id="section41",
303
  ),
304
  Section(
305
  H2("Motivation Behind Global Deduplication"),
 
331
  P(
332
  "Additionally, we maintained statistics about each matching clusters as they were formed during the final stage of deduplication. Below are the details of all four stages of our deduplication pipeline. We use Dask extensively throughout all stages of the deduplication. We have included the size of results of each stage on disk to give an idea about the scale:"
333
  ),
334
+ id="section42",
335
  ),
336
  Section(
337
  H3("MinHash Generation"),
 
339
  "We use the datasketch library to generate MinHash signatures with the number of permutations to 128. Each signature is signature represented as a MinHash object for each document. Before calculating the signature, the text is cleaned by stripping whitespace, converting to lowercase, and removing punctuation, consecutive spaces, newlines, and tabs. Next, a list of 13-grams is generated to use as features for creating a document signature. The globally-unique document IDs and signatures are then saved to disk. The documented ID is designed by an encoding scheme which converts file names and line numbers (there is one document per line) to unique document IDs. This also helped a lot in saving disk and memory for this stage."
340
  ),
341
  P(B("This step produced 20 TB of hashes.")),
342
+ id="section43",
343
  ),
344
  Section(
345
  H3("Matching Pairs Generation"),
 
351
  ),
352
  D_code(dask_algo, block="block", language="python"),
353
  P(B("This step produced 9.2 TB of matching pairs from all bands.")),
354
+ id="section44",
355
  ),
356
  Section(
357
  H3("Finding Duplicate Pairs"),
 
369
  "The resulting unique pairs are then used to identify clusters of near-duplicates by finding connected components in a graph, where the vertices represent documents and the edges represent matches."
370
  ),
371
  P(B("This step produced 1.9 TB of unique pairs.")),
372
+ id="section45",
373
  ),
374
  Section(
375
  H3("Finding Connected Components using MapReduce"),
 
389
  "Below is the distribution of duplicate documents found across different snapshots of CommonCrawl. The distribution is skewed to the right because the documents are bucketed by the dump ID of the document we retain, and we prefer documents from higher dump IDs."
390
  ),
391
  plotly2fasthtml(dup_docs_count_graph()),
392
+ id="section46",
393
  ),
394
  Section(
395
  H3("Analysis of Near-Duplicate Clusters"),
 
434
  style="list-style-type: none",
435
  ),
436
  ),
437
+ id="section47",
438
  ),
439
  Section(
440
  H2("Normalization Form C"),
 
454
  style="list-style-type: none",
455
  )
456
  ), # "background-color= gray" "color= blue" maybe add this later
457
+ id="section48",
458
  ),
459
  Section(
460
  H3("NFC Examples"),
curated.py CHANGED
@@ -1554,27 +1554,7 @@ table_html_data_pipe = data_pipeline_table.to_html(index=False, border=0)
1554
  table_div_data_pipe = Div(NotStr(table_html_data_pipe), style="margin: 40px;")
1555
 
1556
 
1557
- def update(target: str, request):
1558
- params = request.query_params
1559
- if data_source := params.get(f"data_source_{target}"):
1560
- return get_data(data_source, params.get(f"doc_id_{target}", 3), target)
1561
- if doc_id := params.get(f"doc_id_{target}"):
1562
- return get_data(params.get(f"data_source_{target}"), doc_id, target)
1563
-
1564
-
1565
- def curated(request):
1566
- # Partial Updates
1567
- params = dict(request.query_params)
1568
- if target := params.get("target"):
1569
- if data_source := params.get(f"data_source_{target}"):
1570
- return get_data(
1571
- data_source, params.get(f"doc_id_{target}", 3), params.get("target")
1572
- )
1573
- if doc_id := params.get(f"doc_id_{target}"):
1574
- return get_data(
1575
- params.get(f"data_source_{target}"), doc_id, params.get("target")
1576
- )
1577
-
1578
  data_preparation_steps = pd.DataFrame(
1579
  {
1580
  "Method": [
@@ -1623,15 +1603,15 @@ def curated(request):
1623
  Section(
1624
  curated_sources_intro,
1625
  plotly2fasthtml(treemap_chart),
1626
- id="section1",
1627
  ),
1628
  Section(
1629
  data_preprocessing_div,
1630
- id="section2",
1631
  ),
1632
  Section(
1633
  filtering_process,
1634
- id="section3",
1635
  ),
1636
  id="inner-text",
1637
  )
 
1554
  table_div_data_pipe = Div(NotStr(table_html_data_pipe), style="margin: 40px;")
1555
 
1556
 
1557
+ def curated():
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1558
  data_preparation_steps = pd.DataFrame(
1559
  {
1560
  "Method": [
 
1603
  Section(
1604
  curated_sources_intro,
1605
  plotly2fasthtml(treemap_chart),
1606
+ id="section31",
1607
  ),
1608
  Section(
1609
  data_preprocessing_div,
1610
+ id="section32",
1611
  ),
1612
  Section(
1613
  filtering_process,
1614
+ id="section33",
1615
  ),
1616
  id="inner-text",
1617
  )
main.py CHANGED
@@ -175,9 +175,7 @@ def main():
175
  Div(
176
  A(
177
  "TxT360",
178
- href="/intro#section1",
179
- hx_get="/intro#section1",
180
- hx_target="#inner-text",
181
  )
182
  ),
183
  Div(
@@ -185,25 +183,19 @@ def main():
185
  Li(
186
  A(
187
  "About TxT360",
188
- href="/intro#section1",
189
- hx_get="/intro#section1",
190
- hx_target="#inner-text",
191
  )
192
  ),
193
  Li(
194
  A(
195
  "Motivation Behind TxT360",
196
- href="/intro#section2",
197
- hx_get="/intro#section2",
198
- hx_target="#inner-text",
199
  )
200
  ),
201
  Li(
202
  A(
203
  "Generalizable Approach to Data Processing",
204
- href="/intro#section3",
205
- hx_get="/intro#section3",
206
- hx_target="#inner-text",
207
  )
208
  ),
209
  ),
@@ -211,9 +203,7 @@ def main():
211
  Div(
212
  A(
213
  "Web Data Processing",
214
- href="/webdata#section1",
215
- hx_get="/webdata#section1",
216
- hx_target="#inner-text",
217
  )
218
  ),
219
  Div(
@@ -221,41 +211,31 @@ def main():
221
  Li(
222
  A(
223
  "Common Crawl Snapshot Processing",
224
- href="/webdata#section1",
225
- hx_get="/webdata#section1",
226
- hx_target="#inner-text",
227
  )
228
  ),
229
  Li(
230
  A(
231
  "Common Crawl Data Processing Summary",
232
- href="/webdata#section2",
233
- hx_get="/webdata#section2",
234
- hx_target="#inner-text",
235
  )
236
  ),
237
  Li(
238
  A(
239
  "Document Preparation",
240
- href="/webdata#section3",
241
- hx_get="/webdata#section3",
242
- hx_target="#inner-text",
243
  )
244
  ),
245
  Li(
246
  A(
247
  "Line-Level Removal",
248
- href="/webdata#section4",
249
- hx_get="/webdata#section4",
250
- hx_target="#inner-text",
251
  )
252
  ),
253
  Li(
254
  A(
255
  "Document-Level Filtering",
256
- href="/webdata#section5",
257
- hx_get="/webdata#section5",
258
- hx_target="#inner-text",
259
  )
260
  ),
261
  ),
@@ -263,9 +243,7 @@ def main():
263
  Div(
264
  A(
265
  "Curated Sources Processing",
266
- href="/curated#section1",
267
- hx_get="/curated#section1",
268
- hx_target="#inner-text",
269
  )
270
  ),
271
  Div(
@@ -273,25 +251,19 @@ def main():
273
  Li(
274
  A(
275
  "Curated Sources in TxT360",
276
- href="/curated#section1",
277
- hx_get="/curated#section1",
278
- hx_target="#inner-text",
279
  )
280
  ),
281
  Li(
282
  A(
283
  "Filtering Steps and Definitions",
284
- href="/curated#section2",
285
- hx_get="/curated#section2",
286
- hx_target="#inner-text",
287
  )
288
  ),
289
  Li(
290
  A(
291
  "Filtering Discussion on All Curated Sources",
292
- href="/curated#section3",
293
- hx_get="/curated#section3",
294
- hx_target="#inner-text",
295
  )
296
  ),
297
  ),
@@ -299,9 +271,7 @@ def main():
299
  Div(
300
  A(
301
  "Shared Processing Steps",
302
- href="/common#section1",
303
- hx_get="/common#section1",
304
- hx_target="#inner-text",
305
  )
306
  ),
307
  Div(
@@ -309,65 +279,49 @@ def main():
309
  Li(
310
  A(
311
  "Overview",
312
- href="/common#section1",
313
- hx_get="/common#section1",
314
- hx_target="#inner-text",
315
  )
316
  ),
317
  Li(
318
  A(
319
  "Motivation Behind Global Deduplication",
320
- href="/common#section2",
321
- hx_get="/common#section2",
322
- hx_target="#inner-text",
323
  )
324
  ),
325
  Li(
326
  A(
327
  "MinHash Generation",
328
- href="/common#section3",
329
- hx_get="/common#section3",
330
- hx_target="#inner-text",
331
  )
332
  ),
333
  Li(
334
  A(
335
  "Matching Pairs Generation",
336
- href="/common#section4",
337
- hx_get="/common#section4",
338
- hx_target="#inner-text",
339
  )
340
  ),
341
  Li(
342
  A(
343
  "Finding Duplicate Pairs",
344
- href="/common#section5",
345
- hx_get="/common#section5",
346
- hx_target="#inner-text",
347
  )
348
  ),
349
  Li(
350
  A(
351
  "Finding Connected Components using MapReduce",
352
- href="/common#section6",
353
- hx_get="/common#section6",
354
- hx_target="#inner-text",
355
  )
356
  ),
357
  Li(
358
  A(
359
  "Personally Identifiable Information Removal",
360
- href="/common#section7",
361
- hx_get="/common#section7",
362
- hx_target="#inner-text",
363
  )
364
  ),
365
  Li(
366
  A(
367
  "Normalization Form C",
368
- href="/common#section8",
369
- hx_get="/common#section8",
370
- hx_target="#inner-text",
371
  )
372
  ),
373
  ),
@@ -375,9 +329,7 @@ def main():
375
  Div(
376
  A(
377
  "TxT360 Studies",
378
- href="/results#section1",
379
- hx_get="/results#section1",
380
- hx_target="#inner-text",
381
  ),
382
  ),
383
  Div(
@@ -385,25 +337,19 @@ def main():
385
  Li(
386
  A(
387
  "Overview",
388
- href="/results#section1",
389
- hx_get="/results#section1",
390
- hx_target="#inner-text",
391
  )
392
  ),
393
  Li(
394
  A(
395
  "Upsampling Experiment",
396
- href="/results#section2",
397
- hx_get="/results#section2",
398
- hx_target="#inner-text",
399
  )
400
  ),
401
  Li(
402
  A(
403
  "Perplexity Analysis",
404
- href="/results#section3",
405
- hx_get="/results#section3",
406
- hx_target="#inner-text",
407
  )
408
  ),
409
  ),
@@ -413,6 +359,10 @@ def main():
413
  ),
414
  ),
415
  intro(),
 
 
 
 
416
  ),
417
  D_appendix(
418
  D_bibliography(src="bibliography.bib"),
@@ -905,7 +855,7 @@ def intro():
905
  P(
906
  "We documented all implementation details in this blog post and are open sourcing the code. Examples of each filter and rationale supporting each decision are included."
907
  ),
908
- id="section1",
909
  ),
910
  Section(
911
  H2("Motivation Behind TxT360"),
@@ -923,7 +873,7 @@ def intro():
923
  ),
924
  # P("Table 2: Basic TxT360 Statistics."),
925
  # table_div_data,
926
- id="section2",
927
  ),
928
  Section(
929
  H2("Our Generalizable Approach to Data Processing"),
@@ -944,7 +894,7 @@ def intro():
944
  # P(
945
  # "Figure 1: Data processing pipeline. All the steps are adopted for processing web data while the yellow blocks are adopted for processing curated sources."
946
  # ),
947
- id="section3",
948
  ),
949
  id="inner-text",
950
  )
@@ -952,12 +902,4 @@ def intro():
952
 
953
  rt("/update/{target}")(data_viewer.update)
954
 
955
- rt("/curated")(curated.curated)
956
-
957
- rt("/webdata")(web.web_data)
958
-
959
- rt("/common")(common.common_steps)
960
-
961
- rt("/results")(results.results)
962
-
963
  serve()
 
175
  Div(
176
  A(
177
  "TxT360",
178
+ href="#section1",
 
 
179
  )
180
  ),
181
  Div(
 
183
  Li(
184
  A(
185
  "About TxT360",
186
+ href="#section11",
 
 
187
  )
188
  ),
189
  Li(
190
  A(
191
  "Motivation Behind TxT360",
192
+ href="#section12",
 
 
193
  )
194
  ),
195
  Li(
196
  A(
197
  "Generalizable Approach to Data Processing",
198
+ href="#section13",
 
 
199
  )
200
  ),
201
  ),
 
203
  Div(
204
  A(
205
  "Web Data Processing",
206
+ href="#section21",
 
 
207
  )
208
  ),
209
  Div(
 
211
  Li(
212
  A(
213
  "Common Crawl Snapshot Processing",
214
+ href="#section21",
 
 
215
  )
216
  ),
217
  Li(
218
  A(
219
  "Common Crawl Data Processing Summary",
220
+ href="#section22",
 
 
221
  )
222
  ),
223
  Li(
224
  A(
225
  "Document Preparation",
226
+ href="#section23",
 
 
227
  )
228
  ),
229
  Li(
230
  A(
231
  "Line-Level Removal",
232
+ href="#section24",
 
 
233
  )
234
  ),
235
  Li(
236
  A(
237
  "Document-Level Filtering",
238
+ href="#section25",
 
 
239
  )
240
  ),
241
  ),
 
243
  Div(
244
  A(
245
  "Curated Sources Processing",
246
+ href="#section31",
 
 
247
  )
248
  ),
249
  Div(
 
251
  Li(
252
  A(
253
  "Curated Sources in TxT360",
254
+ href="#section31",
 
 
255
  )
256
  ),
257
  Li(
258
  A(
259
  "Filtering Steps and Definitions",
260
+ href="#section32",
 
 
261
  )
262
  ),
263
  Li(
264
  A(
265
  "Filtering Discussion on All Curated Sources",
266
+ href="#section33",
 
 
267
  )
268
  ),
269
  ),
 
271
  Div(
272
  A(
273
  "Shared Processing Steps",
274
+ href="#section41",
 
 
275
  )
276
  ),
277
  Div(
 
279
  Li(
280
  A(
281
  "Overview",
282
+ href="#section41",
 
 
283
  )
284
  ),
285
  Li(
286
  A(
287
  "Motivation Behind Global Deduplication",
288
+ href="#section42",
 
 
289
  )
290
  ),
291
  Li(
292
  A(
293
  "MinHash Generation",
294
+ href="#section43",
 
 
295
  )
296
  ),
297
  Li(
298
  A(
299
  "Matching Pairs Generation",
300
+ href="#section44",
 
 
301
  )
302
  ),
303
  Li(
304
  A(
305
  "Finding Duplicate Pairs",
306
+ href="#section45",
 
 
307
  )
308
  ),
309
  Li(
310
  A(
311
  "Finding Connected Components using MapReduce",
312
+ href="#section46",
 
 
313
  )
314
  ),
315
  Li(
316
  A(
317
  "Personally Identifiable Information Removal",
318
+ href="#section47",
 
 
319
  )
320
  ),
321
  Li(
322
  A(
323
  "Normalization Form C",
324
+ href="#section48",
 
 
325
  )
326
  ),
327
  ),
 
329
  Div(
330
  A(
331
  "TxT360 Studies",
332
+ href="#section51",
 
 
333
  ),
334
  ),
335
  Div(
 
337
  Li(
338
  A(
339
  "Overview",
340
+ href="#section51",
 
 
341
  )
342
  ),
343
  Li(
344
  A(
345
  "Upsampling Experiment",
346
+ href="#section52",
 
 
347
  )
348
  ),
349
  Li(
350
  A(
351
  "Perplexity Analysis",
352
+ href="#section53",
 
 
353
  )
354
  ),
355
  ),
 
359
  ),
360
  ),
361
  intro(),
362
+ curated.curated(),
363
+ web.web_data(),
364
+ common.common_steps(),
365
+ results.results(),
366
  ),
367
  D_appendix(
368
  D_bibliography(src="bibliography.bib"),
 
855
  P(
856
  "We documented all implementation details in this blog post and are open sourcing the code. Examples of each filter and rationale supporting each decision are included."
857
  ),
858
+ id="section11",
859
  ),
860
  Section(
861
  H2("Motivation Behind TxT360"),
 
873
  ),
874
  # P("Table 2: Basic TxT360 Statistics."),
875
  # table_div_data,
876
+ id="section12",
877
  ),
878
  Section(
879
  H2("Our Generalizable Approach to Data Processing"),
 
894
  # P(
895
  # "Figure 1: Data processing pipeline. All the steps are adopted for processing web data while the yellow blocks are adopted for processing curated sources."
896
  # ),
897
+ id="section13",
898
  ),
899
  id="inner-text",
900
  )
 
902
 
903
  rt("/update/{target}")(data_viewer.update)
904
 
 
 
 
 
 
 
 
 
905
  serve()
results.py CHANGED
@@ -979,19 +979,19 @@ def results():
979
  return Div(
980
  Section(
981
  intro_div,
982
- id="section1"
983
  ),
984
  Section(
985
  upsampling_exp,
986
- id="section2"
987
  ),
988
  Section(
989
  preplexity_intro_div,
990
- id="section3"
991
  ),
992
  Section(
993
  perp1_div,
994
- id="section4"
995
  ),
996
  Section(
997
  llama_div,
 
979
  return Div(
980
  Section(
981
  intro_div,
982
+ id="section51"
983
  ),
984
  Section(
985
  upsampling_exp,
986
+ id="section52"
987
  ),
988
  Section(
989
  preplexity_intro_div,
990
+ id="section53"
991
  ),
992
  Section(
993
  perp1_div,
994
+ id="section54"
995
  ),
996
  Section(
997
  llama_div,
style.css CHANGED
@@ -288,3 +288,7 @@ d-appendix .citation {
288
  white-space: pre-wrap;
289
  word-wrap: break-word;
290
  }
 
 
 
 
 
288
  white-space: pre-wrap;
289
  word-wrap: break-word;
290
  }
291
+
292
+ html {
293
+ scroll-behavior: smooth;
294
+ }
web.py CHANGED
@@ -390,7 +390,7 @@ def web_data():
390
  ),
391
  P("To generate a high-quality dataset from large-scale webpages, we have investigated the processing steps used by the community and made our choices based on careful manual inspection. Below is a comprehensive list of datasets we reviewed the comparison of filters we have applied."),
392
  ),
393
- id="section1",),
394
  Section(
395
  H3("TxT360 CommonCrawl Filtering vs Other Pretraining Datasets"),
396
  P("The following section provides explicit details covering the reasoning and decisions behind each of the filters we applied. The table below provides a high-level comparison of TxT360's filtering compared to other commonly used pretraining datasets."),
@@ -402,7 +402,7 @@ def web_data():
402
  # The sankey diagram of the filtering percentage
403
  plotly2fasthtml(filtering_sankey_fig),
404
  P("A significant portion of the documents is filtered after the whole process. This figure illustrates the percentage of documents filtered at each step. The grey bars represent the filtered documents. The statistics are largely consistent with prior work (e.g., RefinedWeb) across most steps, though we have incorporated some custom filtering steps."),
405
- id="section2",),
406
  Section(
407
  H2("Document Preparation"),
408
 
@@ -563,7 +563,7 @@ def web_data():
563
  """,
564
  ),
565
 
566
- id="section3",),
567
  Section(
568
  H2("Line-Level Removal"),
569
  P("""
@@ -677,7 +677,7 @@ def web_data():
677
  margin-bottom: 15px
678
  """,
679
  ),
680
- id="section4",),
681
  Section(
682
  H2("Document-Level Filtering"),
683
  P("""
@@ -1748,5 +1748,5 @@ def web_data():
1748
  margin-bottom: 15px
1749
  """,
1750
  ),
1751
- id="section5",),
1752
  )
 
390
  ),
391
  P("To generate a high-quality dataset from large-scale webpages, we have investigated the processing steps used by the community and made our choices based on careful manual inspection. Below is a comprehensive list of datasets we reviewed the comparison of filters we have applied."),
392
  ),
393
+ id="section21"),
394
  Section(
395
  H3("TxT360 CommonCrawl Filtering vs Other Pretraining Datasets"),
396
  P("The following section provides explicit details covering the reasoning and decisions behind each of the filters we applied. The table below provides a high-level comparison of TxT360's filtering compared to other commonly used pretraining datasets."),
 
402
  # The sankey diagram of the filtering percentage
403
  plotly2fasthtml(filtering_sankey_fig),
404
  P("A significant portion of the documents is filtered after the whole process. This figure illustrates the percentage of documents filtered at each step. The grey bars represent the filtered documents. The statistics are largely consistent with prior work (e.g., RefinedWeb) across most steps, though we have incorporated some custom filtering steps."),
405
+ id="section22",),
406
  Section(
407
  H2("Document Preparation"),
408
 
 
563
  """,
564
  ),
565
 
566
+ id="section23",),
567
  Section(
568
  H2("Line-Level Removal"),
569
  P("""
 
677
  margin-bottom: 15px
678
  """,
679
  ),
680
+ id="section24",),
681
  Section(
682
  H2("Document-Level Filtering"),
683
  P("""
 
1748
  margin-bottom: 15px
1749
  """,
1750
  ),
1751
+ id="section25",),
1752
  )