Spaces:
Running
Running
updates to wording, source code page, signature
Browse files- docs/data/presse.parquet.sh +0 -2
- docs/index.md +25 -4
- docs/source.md +25 -0
- observablehq.config.ts +4 -2
docs/data/presse.parquet.sh
CHANGED
@@ -1,5 +1,3 @@
|
|
1 |
-
# file_id, ocr, title, date, author, page_count, word_count, character_count
|
2 |
-
|
3 |
echo """
|
4 |
CREATE TABLE presse AS (
|
5 |
SELECT title
|
|
|
|
|
|
|
1 |
echo """
|
2 |
CREATE TABLE presse AS (
|
3 |
SELECT title
|
docs/index.md
CHANGED
@@ -1,10 +1,14 @@
|
|
1 |
-
#
|
2 |
|
3 |
-
A
|
4 |
|
5 |
-
|
6 |
|
7 |
-
|
|
|
|
|
|
|
|
|
8 |
|
9 |
In this project, I’m exploring two aspects of the dataset:
|
10 |
|
@@ -39,3 +43,20 @@ Plot.plot({
|
|
39 |
],
|
40 |
});
|
41 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# French public domain newspapers
|
2 |
|
3 |
+
## A quick glance at 3 million periodicals
|
4 |
|
5 |
+
<p class=signature>by <a href="https://observablehq.com/@fil">Fil</a>
|
6 |
|
7 |
+
This new fascinating dataset just dropped on Hugging Face : [French public domain newspapers](https://huggingface.co/datasets/PleIAs/French-PD-Newspapers) 🤗 references about **3 million newspapers and periodicals** with their full text OCR’ed and some meta-data.
|
8 |
+
|
9 |
+
The data is stored in 320 large parquet files. The data loader for this [Observable framework](https://observablehq.com/framework) project uses [DuckDB](https://duckdb.org/) to read these files (altogether about 200GB) and combines a minimal subset of their metadata — title and year of publication, most importantly without the text contents —, into a single highly optimized parquet file. This takes only about 1 minute to run in a hugging-face Space.
|
10 |
+
|
11 |
+
The resulting file is small enough (about 8MB) that we can load it in the browser and create “live” charts with [Observable Plot](https://observablehq.com/plot).
|
12 |
|
13 |
In this project, I’m exploring two aspects of the dataset:
|
14 |
|
|
|
43 |
],
|
44 |
});
|
45 |
```
|
46 |
+
|
47 |
+
<style>
|
48 |
+
|
49 |
+
.signature a[href] {
|
50 |
+
color: var(--theme-foreground)
|
51 |
+
}
|
52 |
+
|
53 |
+
.signature {
|
54 |
+
text-align: right;
|
55 |
+
font-size: small;
|
56 |
+
}
|
57 |
+
|
58 |
+
.signature::before {
|
59 |
+
content: "◼︎ ";
|
60 |
+
}
|
61 |
+
|
62 |
+
</style>
|
docs/source.md
ADDED
@@ -0,0 +1,25 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Source code
|
2 |
+
|
3 |
+
This project relies on a **data loader** that reads all the source files and outputs a single summary file, minimized to contain only a subset of the source information:
|
4 |
+
|
5 |
+
```js
|
6 |
+
import hljs from "npm:highlight.js";
|
7 |
+
```
|
8 |
+
|
9 |
+
`data/presse.parquet.sh`
|
10 |
+
|
11 |
+
```js
|
12 |
+
const pre = display(document.createElement("pre"));
|
13 |
+
FileAttachment("data/presse.parquet.sh")
|
14 |
+
.text()
|
15 |
+
.then(
|
16 |
+
(text) => (pre.innerHTML = hljs.highlight(text, { language: "bash" }).value)
|
17 |
+
);
|
18 |
+
```
|
19 |
+
|
20 |
+
This is the file that the other pages load with DuckDB, as a FileAttachment:
|
21 |
+
|
22 |
+
```js echo run=false
|
23 |
+
import { DuckDBClient } from "npm:@observablehq/duckdb";
|
24 |
+
const db = DuckDBClient.of({ presse: FileAttachment("data/presse.parquet") });
|
25 |
+
```
|
observablehq.config.ts
CHANGED
@@ -1,3 +1,5 @@
|
|
1 |
export default {
|
2 |
-
|
3 |
-
|
|
|
|
|
|
1 |
export default {
|
2 |
+
title: "FPDN",
|
3 |
+
footer: `<script type="module" src="https://cdnjs.cloudflare.com/ajax/libs/iframe-resizer/4.3.9/iframeResizer.contentWindow.min.js"></script>`,
|
4 |
+
pager: false,
|
5 |
+
};
|