File size: 4,953 Bytes
26db318
 
9346170
26db318
9346170
26db318
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a1a75ce
7da72ab
a1a75ce
7da72ab
a1a75ce
7da72ab
a1a75ce
 
 
 
 
7da72ab
1be7f91
7da72ab
1be7f91
7da72ab
1be7f91
7da72ab
1be7f91
7da72ab
1be7f91
 
 
7da72ab
 
1be7f91
 
7da72ab
 
1be7f91
7da72ab
1be7f91
7da72ab
1be7f91
7da72ab
1be7f91
 
 
 
 
 
 
 
 
7da72ab
a1a75ce
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
<header id="observablehq-header">
<style>
#observablehq-header a[href] {color: inherit;}
@container not (min-width: 640px) {
.hide-if-small {display: none;}
}
</style>
<div style="display: flex; align-items: center; gap: 0.5rem; height: 2.2rem; margin: -1.5rem -2rem 2rem -2rem; padding: 0.5rem 2rem; border-bottom: solid 1px var(--theme-foreground-faintest); font: 500 16px var(--sans-serif);">
  <a href="https://observablehq.com/" style="display: flex; align-items: center;">
    <svg width="22" height="22" viewBox="0 0 21.92930030822754 22.68549919128418" fill="currentColor">
      <path d="M10.9646 18.9046C9.95224 18.9046 9.07507 18.6853 8.33313 18.2467C7.59386 17.8098 7.0028 17.1909 6.62722 16.4604C6.22789 15.7003 5.93558 14.8965 5.75735 14.0684C5.56825 13.1704 5.47613 12.2574 5.48232 11.3427C5.48232 10.6185 5.52984 9.92616 5.62578 9.26408C5.7208 8.60284 5.89715 7.93067 6.15391 7.24843C6.41066 6.56618 6.74143 5.97468 7.14438 5.47308C7.56389 4.9592 8.1063 4.54092 8.72969 4.25059C9.38391 3.93719 10.1277 3.78091 10.9646 3.78091C11.977 3.78091 12.8542 4.00021 13.5962 4.43879C14.3354 4.87564 14.9265 5.49454 15.3021 6.22506C15.6986 6.97704 15.9883 7.7744 16.1719 8.61712C16.3547 9.459 16.447 10.3681 16.447 11.3427C16.447 12.067 16.3995 12.7593 16.3035 13.4214C16.2013 14.1088 16.0206 14.7844 15.7644 15.437C15.4994 16.1193 15.1705 16.7108 14.7739 17.2124C14.3774 17.714 13.8529 18.1215 13.1996 18.4349C12.5463 18.7483 11.8016 18.9046 10.9646 18.9046ZM12.8999 13.3447C13.4242 12.8211 13.7159 12.0966 13.7058 11.3427C13.7058 10.5639 13.4436 9.89654 12.92 9.34074C12.3955 8.78495 11.7441 8.50705 10.9646 8.50705C10.1852 8.50705 9.53376 8.78495 9.00928 9.34074C8.49569 9.87018 8.21207 10.5928 8.22348 11.3427C8.22348 12.1216 8.48572 12.7889 9.00928 13.3447C9.53376 13.9005 10.1852 14.1784 10.9646 14.1784C11.7441 14.1784 12.3891 13.9005 12.8999 13.3447ZM10.9646 22.6855C17.0199 22.6855 21.9293 17.6068 21.9293 11.3427C21.9293 5.07871 17.0199 0 10.9646 0C4.90942 0 0 5.07871 0 11.3427C0 17.6068 4.90942 22.6855 10.9646 22.6855Z"></path>
    </svg>
  </a>
  <div style="display: flex; flex-grow: 1; justify-content: space-between; align-items: baseline;">
    <a href="https://observablehq.com/framework/">
      <span class="hide-if-small">Observable</span> Framework
    </a>
  </div>
</div>
</header>

# French public domain newspapers

## A quick glance at 3&nbsp;million periodicals

<p class=signature>by <a href="https://observablehq.com/@fil">Fil</a>

This new fascinating dataset just dropped on Hugging Face&nbsp;: [French public domain newspapers](https://huggingface.co/datasets/PleIAs/French-PD-Newspapers) 🤗 references about **3&nbsp;million newspapers and periodicals** with their full text OCR’ed and some meta-data.

The data is stored in 320 large parquet files. The data loader for this [Observable framework](https://observablehq.com/framework) project uses [DuckDB](https://duckdb.org/) to read these files (altogether about 200GB) and combines a minimal subset of their metadata — title and year of publication, most importantly without the text contents&nbsp;—, into a single highly optimized parquet file. This takes only about 1 minute to run in a hugging-face Space.

The resulting file is small enough (about 8MB) that we can load it in the browser and create “live” charts with [Observable Plot](https://observablehq.com/plot).

In this project, I’m exploring two aspects of the dataset:

- As I played with the titles, I saw that the word “gazette” was quite frequent in the 17th Century. An exploration of the words used in the titles is on the page [gazette](gazette).

- A lot of publications stopped or started publishing during the second world war. Explored in [resistance](resistance).

This page summarizes the time distribution of the data:

```js echo
import { DuckDBClient } from "npm:@observablehq/duckdb";
const db = DuckDBClient.of({ presse: FileAttachment("data/presse.parquet") });
```

```js echo
const dates = db.query("SELECT year FROM presse WHERE year >= '1000'");
```

_Note: due to the date pattern matching I’m using, unknown years are marked as 0000. Hence the filter above._

The chart below indicates that the bulk of the contents collected in this database was published between 1850 and 1950. It’s obviously not that the _presse_ stopped after 1950, but because most of the printed world after that threshold year is still out of reach of researchers, as it is “protected” by copyright or _droit d’auteur._

${Plot.rectY(dates, Plot.binX({ y: "count" }, { x: "year", interval: "5 years" })).plot({ marginLeft: 60 })}

```js echo run=false
Plot.plot({
  marks: [
    Plot.rectY(
      dates,
      Plot.binX({ y: "count" }, { x: "year", interval: "5 years" })
    ),
  ],
});
```

<style>

.signature a[href] {
  color: var(--theme-foreground)
}

.signature {
  text-align: right;
  font-size: small;
}

.signature::before {
  content: "◼︎ ";
}

</style>