fil commited on
Commit
a1a75ce
·
1 Parent(s): 22c1938

updates to wording, source code page, signature

Browse files
docs/data/presse.parquet.sh CHANGED
@@ -1,5 +1,3 @@
1
- # file_id, ocr, title, date, author, page_count, word_count, character_count
2
-
3
  echo """
4
  CREATE TABLE presse AS (
5
  SELECT title
 
 
 
1
  echo """
2
  CREATE TABLE presse AS (
3
  SELECT title
docs/index.md CHANGED
@@ -1,10 +1,14 @@
1
- # FPDN exploration
2
 
3
- A new fascinating dataset just dropped on 🤗. [French public domain newspapers](https://huggingface.co/datasets/PleIAs/French-PD-Newspapers) references about 3 million newspapers and periodicals with their full text OCR’ed and some meta-data.
4
 
5
- The data is stored in 320 chunks weighting about 700MB each. The data loader for this Observable project uses DuckDB to read these 320 parquet files (altogether about 200GB) and combines a minimal subset of their metadata — title and year of publication, most importantly without the text contents —, into a single parquet file. It takes only about 1 minute to run in a hugging-face Space.
6
 
7
- The resulting file is small enough (about 8MB) that we can load it in the browser and create “live” charts with Observable Plot.
 
 
 
 
8
 
9
  In this project, I’m exploring two aspects of the dataset:
10
 
@@ -39,3 +43,20 @@ Plot.plot({
39
  ],
40
  });
41
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # French public domain newspapers
2
 
3
+ ## A quick glance at 3 million periodicals
4
 
5
+ <p class=signature>by <a href="https://observablehq.com/@fil">Fil</a>
6
 
7
+ This new fascinating dataset just dropped on Hugging Face&nbsp;: [French public domain newspapers](https://huggingface.co/datasets/PleIAs/French-PD-Newspapers) 🤗 references about **3&nbsp;million newspapers and periodicals** with their full text OCR’ed and some meta-data.
8
+
9
+ The data is stored in 320 large parquet files. The data loader for this [Observable framework](https://observablehq.com/framework) project uses [DuckDB](https://duckdb.org/) to read these files (altogether about 200GB) and combines a minimal subset of their metadata — title and year of publication, most importantly without the text contents&nbsp;—, into a single highly optimized parquet file. This takes only about 1 minute to run in a hugging-face Space.
10
+
11
+ The resulting file is small enough (about 8MB) that we can load it in the browser and create “live” charts with [Observable Plot](https://observablehq.com/plot).
12
 
13
  In this project, I’m exploring two aspects of the dataset:
14
 
 
43
  ],
44
  });
45
  ```
46
+
47
+ <style>
48
+
49
+ .signature a[href] {
50
+ color: var(--theme-foreground)
51
+ }
52
+
53
+ .signature {
54
+ text-align: right;
55
+ font-size: small;
56
+ }
57
+
58
+ .signature::before {
59
+ content: "◼︎ ";
60
+ }
61
+
62
+ </style>
docs/source.md ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Source code
2
+
3
+ This project relies on a **data loader** that reads all the source files and outputs a single summary file, minimized to contain only a subset of the source information:
4
+
5
+ ```js
6
+ import hljs from "npm:highlight.js";
7
+ ```
8
+
9
+ `data/presse.parquet.sh`
10
+
11
+ ```js
12
+ const pre = display(document.createElement("pre"));
13
+ FileAttachment("data/presse.parquet.sh")
14
+ .text()
15
+ .then(
16
+ (text) => (pre.innerHTML = hljs.highlight(text, { language: "bash" }).value)
17
+ );
18
+ ```
19
+
20
+ This is the file that the other pages load with DuckDB, as a FileAttachment:
21
+
22
+ ```js echo run=false
23
+ import { DuckDBClient } from "npm:@observablehq/duckdb";
24
+ const db = DuckDBClient.of({ presse: FileAttachment("data/presse.parquet") });
25
+ ```
observablehq.config.ts CHANGED
@@ -1,3 +1,5 @@
1
  export default {
2
- header: `<script src="https://cdnjs.cloudflare.com/ajax/libs/iframe-resizer/4.3.9/iframeResizer.contentWindow.min.js" async=""></script>`
3
- };
 
 
 
1
  export default {
2
+ title: "FPDN",
3
+ footer: `<script type="module" src="https://cdnjs.cloudflare.com/ajax/libs/iframe-resizer/4.3.9/iframeResizer.contentWindow.min.js"></script>`,
4
+ pager: false,
5
+ };