Spaces:
Running
Running
something readable
Browse files- Dockerfile +2 -2
- docs/data/presse.parquet.sh +5 -3
- docs/gazette.md +55 -0
- docs/index.md +26 -201
- docs/resistance.md +107 -0
Dockerfile
CHANGED
@@ -1,8 +1,8 @@
|
|
1 |
FROM node:latest
|
2 |
|
3 |
-
RUN
|
4 |
|
5 |
-
RUN
|
6 |
|
7 |
USER user
|
8 |
|
|
|
1 |
FROM node:latest
|
2 |
|
3 |
+
RUN arch # print arch info
|
4 |
|
5 |
+
RUN useradd -o -u 1000 user
|
6 |
|
7 |
USER user
|
8 |
|
docs/data/presse.parquet.sh
CHANGED
@@ -1,6 +1,10 @@
|
|
|
|
|
|
1 |
echo """
|
2 |
CREATE TABLE presse AS (
|
3 |
-
SELECT
|
|
|
|
|
4 |
FROM read_parquet([""" > $TMPDIR/presse.sql
|
5 |
|
6 |
for i in $(seq 1 320); do
|
@@ -10,8 +14,6 @@ done
|
|
10 |
echo """ ])
|
11 |
);
|
12 |
|
13 |
-
ALTER TABLE presse ALTER COLUMN date TYPE DATE;
|
14 |
-
|
15 |
COPY presse TO '$TMPDIR/presse.parquet' (FORMAT 'parquet', COMPRESSION 'GZIP');
|
16 |
""" >> $TMPDIR/presse.sql
|
17 |
|
|
|
1 |
+
# file_id, ocr, title, date, author, page_count, word_count, character_count
|
2 |
+
|
3 |
echo """
|
4 |
CREATE TABLE presse AS (
|
5 |
+
SELECT title
|
6 |
+
, author
|
7 |
+
, LPAD((REGEXP_EXTRACT(date, '1[0-9][0-9][0-9]') || '-01-01'), 10, '0')::DATE AS year
|
8 |
FROM read_parquet([""" > $TMPDIR/presse.sql
|
9 |
|
10 |
for i in $(seq 1 320); do
|
|
|
14 |
echo """ ])
|
15 |
);
|
16 |
|
|
|
|
|
17 |
COPY presse TO '$TMPDIR/presse.parquet' (FORMAT 'parquet', COMPRESSION 'GZIP');
|
18 |
""" >> $TMPDIR/presse.sql
|
19 |
|
docs/gazette.md
ADDED
@@ -0,0 +1,55 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Gazette
|
2 |
+
|
3 |
+
```js echo
|
4 |
+
import { DuckDBClient } from "npm:@observablehq/duckdb";
|
5 |
+
const db = DuckDBClient.of({ presse: FileAttachment("data/presse.parquet") });
|
6 |
+
```
|
7 |
+
|
8 |
+
This page allows you to explore the 3 million newspapers by title. I called it “Gazette” because I was surprised that most of the corpus in the earlier years had a title containing this word.
|
9 |
+
|
10 |
+
Type in words such as “jeune”, “révolution”, “république”, “soir”, “fille”, “femme”, “paysan”, “ouvrier”, “social”, etc., to see different historical trends.
|
11 |
+
|
12 |
+
```js
|
13 |
+
const search = view(
|
14 |
+
Inputs.text({ type: "search", value: "gazette", submit: true })
|
15 |
+
);
|
16 |
+
```
|
17 |
+
|
18 |
+
```js echo
|
19 |
+
display(
|
20 |
+
Plot.plot({
|
21 |
+
x: { nice: true },
|
22 |
+
y: {
|
23 |
+
label: `Share of titles matching ${search}`,
|
24 |
+
tickFormat: "%",
|
25 |
+
},
|
26 |
+
marks: [
|
27 |
+
Plot.areaY(gazette, {
|
28 |
+
x: "year",
|
29 |
+
y: (d) => d.matches / d.total,
|
30 |
+
fillOpacity: 0.2,
|
31 |
+
}),
|
32 |
+
Plot.lineY(gazette, {
|
33 |
+
x: "year",
|
34 |
+
y: (d) => d.matches / d.total,
|
35 |
+
}),
|
36 |
+
],
|
37 |
+
})
|
38 |
+
);
|
39 |
+
```
|
40 |
+
|
41 |
+
The query uses the [REGEXP_MATCHES](https://duckdb.org/docs/archive/0.9.2/sql/functions/patternmatching) operator to count occurrences; you can query for example “socialis[tm]e” to match both “socialiste” and “socialisme”. The 'i' flag makes it ignore case.
|
42 |
+
|
43 |
+
```js echo
|
44 |
+
const gazette = db.query(
|
45 |
+
`SELECT year
|
46 |
+
, SUM(CASE WHEN REGEXP_MATCHES(title, ?, 'i') THEN 1 ELSE 0 END)::int matches
|
47 |
+
, COUNT(*) total
|
48 |
+
FROM presse
|
49 |
+
WHERE year > '1000'
|
50 |
+
GROUP BY year
|
51 |
+
ORDER BY year
|
52 |
+
`,
|
53 |
+
[search]
|
54 |
+
);
|
55 |
+
```
|
docs/index.md
CHANGED
@@ -1,218 +1,43 @@
|
|
1 |
-
#
|
2 |
|
3 |
-
|
4 |
|
5 |
-
|
6 |
|
7 |
-
|
8 |
|
9 |
-
|
10 |
|
11 |
-
In
|
12 |
|
13 |
-
|
14 |
|
15 |
-
|
16 |
-
import { DuckDBClient } from "npm:@observablehq/duckdb";
|
17 |
-
const db = DuckDBClient.of({ presse: FileAttachment("data/presse.parquet") });
|
18 |
-
```
|
19 |
-
|
20 |
-
```js
|
21 |
-
const selection = db.query(
|
22 |
-
`SELECT * FROM presse WHERE date >= '1600' AND date < '1820'`
|
23 |
-
);
|
24 |
-
```
|
25 |
-
|
26 |
-
```js
|
27 |
-
const letters = [];
|
28 |
-
const dates = [];
|
29 |
-
|
30 |
-
for (const { date, title } of selection) {
|
31 |
-
if (title) {
|
32 |
-
for (let l of [...title]) {
|
33 |
-
l = l.toUpperCase();
|
34 |
-
if (l >= "A" && l <= "Z") {
|
35 |
-
const d = +d3.isoParse(date);
|
36 |
-
if (d) {
|
37 |
-
dates.push(d);
|
38 |
-
letters.push(l);
|
39 |
-
}
|
40 |
-
}
|
41 |
-
}
|
42 |
-
}
|
43 |
-
}
|
44 |
-
```
|
45 |
-
|
46 |
-
```js
|
47 |
-
// display(letters);
|
48 |
-
// display(dates);
|
49 |
-
```
|
50 |
-
|
51 |
-
```js
|
52 |
-
display(
|
53 |
-
Plot.plot({
|
54 |
-
marks: [
|
55 |
-
Plot.areaY(
|
56 |
-
letters,
|
57 |
-
Plot.stackY(
|
58 |
-
{ offset: "normalize", _o_rder: "value" },
|
59 |
-
Plot.binX(
|
60 |
-
{
|
61 |
-
y: "count",
|
62 |
-
filter: null,
|
63 |
-
},
|
64 |
-
{
|
65 |
-
x: dates,
|
66 |
-
thresholds: "5 years",
|
67 |
-
fill: letters,
|
68 |
-
tip: true,
|
69 |
-
}
|
70 |
-
)
|
71 |
-
)
|
72 |
-
),
|
73 |
-
],
|
74 |
-
})
|
75 |
-
);
|
76 |
-
```
|
77 |
-
|
78 |
-
```js
|
79 |
-
const search = view(Inputs.text({ type: "search", placeholder: "search…" }));
|
80 |
-
```
|
81 |
-
|
82 |
-
```js
|
83 |
-
display(selection);
|
84 |
-
```
|
85 |
|
86 |
-
|
87 |
-
const test = new RegExp(search, "i");
|
88 |
-
```
|
89 |
|
90 |
-
```js
|
91 |
-
|
92 |
-
|
93 |
-
marginLeft: 60,
|
94 |
-
marks: [
|
95 |
-
Plot.rectY(
|
96 |
-
selection,
|
97 |
-
Plot.binX(
|
98 |
-
{
|
99 |
-
y: "count",
|
100 |
-
},
|
101 |
-
{
|
102 |
-
x: "date",
|
103 |
-
fill: (d) => d.title && test.test(d.title),
|
104 |
-
thresholds: "1 year",
|
105 |
-
}
|
106 |
-
)
|
107 |
-
),
|
108 |
-
],
|
109 |
-
})
|
110 |
-
);
|
111 |
-
```
|
112 |
-
|
113 |
-
```js
|
114 |
-
const authors = db.query(
|
115 |
-
`
|
116 |
-
SELECT author
|
117 |
-
, MIN(date) AS start
|
118 |
-
, MAX(date) AS end
|
119 |
-
FROM presse
|
120 |
-
WHERE date <> 'None'
|
121 |
-
GROUP BY 1
|
122 |
-
`
|
123 |
-
);
|
124 |
-
```
|
125 |
-
|
126 |
-
```js
|
127 |
-
display(authors);
|
128 |
```
|
129 |
|
130 |
-
```js
|
131 |
-
|
132 |
-
Plot.plot({
|
133 |
-
x: { type: "utc" },
|
134 |
-
marks: [
|
135 |
-
Plot.ruleY(authors, {
|
136 |
-
x1: "start",
|
137 |
-
x2: "end",
|
138 |
-
y: "author",
|
139 |
-
stroke: (d) => d3.isoParse(d.end) - d3.isoParse(d.start),
|
140 |
-
sort: {
|
141 |
-
y: "stroke",
|
142 |
-
},
|
143 |
-
}),
|
144 |
-
],
|
145 |
-
})
|
146 |
-
);
|
147 |
```
|
148 |
|
149 |
-
|
150 |
-
display(
|
151 |
-
Plot.plot({
|
152 |
-
x: { type: "utc" },
|
153 |
-
marks: [
|
154 |
-
Plot.rectY(
|
155 |
-
authors,
|
156 |
-
Plot.binX(
|
157 |
-
{ y: "count" },
|
158 |
-
{
|
159 |
-
x: "start",
|
160 |
-
fill: (d) => d3.isoParse(d.end)?.getUTCFullYear(),
|
161 |
-
thresholds: "1 year",
|
162 |
-
tip: true,
|
163 |
-
}
|
164 |
-
)
|
165 |
-
),
|
166 |
-
],
|
167 |
-
})
|
168 |
-
);
|
169 |
-
```
|
170 |
|
171 |
-
|
172 |
-
display(
|
173 |
-
Plot.plot({
|
174 |
-
x: { type: "utc" },
|
175 |
-
marks: [
|
176 |
-
Plot.rectY(
|
177 |
-
authors,
|
178 |
-
Plot.binX(
|
179 |
-
{ y: "count" },
|
180 |
-
{
|
181 |
-
x: "end",
|
182 |
-
fill: (d) => d3.isoParse(d.start)?.getUTCFullYear(),
|
183 |
-
thresholds: "1 year",
|
184 |
-
tip: true,
|
185 |
-
}
|
186 |
-
)
|
187 |
-
),
|
188 |
-
],
|
189 |
-
})
|
190 |
-
);
|
191 |
-
```
|
192 |
|
193 |
-
|
194 |
-
const aroundWar = db.query(
|
195 |
-
`SELECT * FROM presse WHERE date >= '1920' AND date < '1970'`
|
196 |
-
);
|
197 |
-
```
|
198 |
|
199 |
-
```js
|
200 |
-
|
201 |
-
|
202 |
-
|
203 |
-
|
204 |
-
Plot.
|
205 |
-
|
206 |
-
|
207 |
-
|
208 |
-
{
|
209 |
-
x: "date",
|
210 |
-
thresholds: "1 year",
|
211 |
-
tip: true,
|
212 |
-
}
|
213 |
-
)
|
214 |
-
),
|
215 |
-
],
|
216 |
-
})
|
217 |
-
);
|
218 |
```
|
|
|
1 |
+
# FPDN exploration
|
2 |
|
3 |
+
A new fascinating dataset just dropped on 🤗. [French public domain newspapers](https://huggingface.co/datasets/PleIAs/French-PD-Newspapers) references about 3 million newspapers and periodicals with their full text OCR’ed and some meta-data.
|
4 |
|
5 |
+
The data is stored in 320 chunks weighting about 700MB each, each continaing about 7,500 texts.
|
6 |
|
7 |
+
The data loader for this Observable project uses DuckDB to read these 320 parquet files (altogether about 200GB) and combines a minimal subset of their metadata — title and year of publication, most importantly without the text contents —, into a single parquet file. It takes only about 1 minute to run in a hugging-face Space.
|
8 |
|
9 |
+
The resulting file is small enough (about 8MB) that we can load it in the browser and create “live” charts with Observable Plot.
|
10 |
|
11 |
+
In this project, I’m exploring two aspects of the dataset:
|
12 |
|
13 |
+
- As I played with the titles, I saw that the word “gazette” was quite frequent in the 17th Century. An exploration of the words used in the titles is on the page [gazette](gazette).
|
14 |
|
15 |
+
- A lot of publications stopped or started publishing during the second world war. Explored in [resistance](resistance).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
16 |
|
17 |
+
This page summarizes the time distribution of the data:
|
|
|
|
|
18 |
|
19 |
+
```js echo
|
20 |
+
import { DuckDBClient } from "npm:@observablehq/duckdb";
|
21 |
+
const db = DuckDBClient.of({ presse: FileAttachment("data/presse.parquet") });
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
22 |
```
|
23 |
|
24 |
+
```js echo
|
25 |
+
const dates = db.query("SELECT year FROM presse WHERE year >= '1000'");
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
26 |
```
|
27 |
|
28 |
+
_Note: due to the date pattern matching I’m using, unknown years are marked as 0000. Hence the filter above._
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
29 |
|
30 |
+
The chart below indicates that the bulk of the contents collected in this database was published between 1850 and 1950. It’s obviously not that the _presse_ stopped after 1950, but because most of the printed world after that threshold year is still out of reach of researchers, as it is “protected” by copyright or _droit d’auteur._
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
31 |
|
32 |
+
${Plot.rectY(dates, Plot.binX({ y: "count" }, { x: "year", interval: "5 years" })).plot({ marginLeft: 60 })}
|
|
|
|
|
|
|
|
|
33 |
|
34 |
+
```js echo run=false
|
35 |
+
Plot.plot({
|
36 |
+
marks: [
|
37 |
+
Plot.rectY(
|
38 |
+
dates,
|
39 |
+
Plot.binX({ y: "count" }, { x: "year", interval: "5 years" })
|
40 |
+
),
|
41 |
+
],
|
42 |
+
});
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
43 |
```
|
docs/resistance.md
ADDED
@@ -0,0 +1,107 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Resistance
|
2 |
+
|
3 |
+
During the second world war the nazis occupied the northern half of France, and the collaborationist governement of Pétain was left to rule over the southern half (the “[zone libre](https://fr.wikipedia.org/wiki/Zone_libre)”). A lot of newspapers at that time were closed, others submitted to the occupiers (some even enthusiastically collaborated). At the same time, a range of clandestine publications started circulating, often associated with the resistance movements. When the country was Liberated in 1944, the most outrageously collaborationist press was dismantled, other newspapers changed their names and were sometimes taken over by new teams of resistance journalists. The most famous case is “Le Temps,” a daily newspaper that had been [publishing since 1861](<https://fr.wikipedia.org/wiki/Le_Temps_(quotidien_fran%C3%A7ais,_1861-1942)>) and had closed in 1942. Although not a collaborationist newspaper, it was not allowed to reopen, and its assets were transferred to create “Le Monde” on 19 December 1944, under Hubert Beuve-Méry.
|
4 |
+
|
5 |
+
```js echo
|
6 |
+
import { DuckDBClient } from "npm:@observablehq/duckdb";
|
7 |
+
const db = DuckDBClient.of({ presse: FileAttachment("data/presse.parquet") });
|
8 |
+
```
|
9 |
+
|
10 |
+
```js echo
|
11 |
+
const letemps = db.query(
|
12 |
+
"SELECT year FROM presse WHERE title = 'Le Temps' AND year > '1000'"
|
13 |
+
);
|
14 |
+
```
|
15 |
+
|
16 |
+
```js echo
|
17 |
+
display(
|
18 |
+
Plot.plot({
|
19 |
+
caption: "Number of issues of Le Temps in the dataset, per year",
|
20 |
+
x: { nice: true },
|
21 |
+
y: { grid: true },
|
22 |
+
marks: [
|
23 |
+
Plot.ruleY([0]),
|
24 |
+
Plot.rectY(
|
25 |
+
letemps,
|
26 |
+
Plot.binX({ y: "count" }, { x: "year", interval: "year" })
|
27 |
+
),
|
28 |
+
],
|
29 |
+
})
|
30 |
+
);
|
31 |
+
```
|
32 |
+
|
33 |
+
(Unfortunately, “Le Monde” is not part of the dataset.)
|
34 |
+
|
35 |
+
The number of titles that stopped or started publishing exploded in those fatal years. Note that many of these publications were short-lived, such as this example picked at random in the dataset: [Au-devant de la vie. Organe de l'Union des jeunes filles patriotes (UJFP), Région parisienne](https://gallica.bnf.fr/ark:/12148/bpt6k76208732?rk=21459;2). While the the UJFP (a resistance organisation of communist young women) published several titles during the war, only one issue was distributed under this title.
|
36 |
+
|
37 |
+
```js echo
|
38 |
+
const years = db.query(`
|
39 |
+
SELECT title
|
40 |
+
, MIN(year) AS start
|
41 |
+
, MAX(year) AS end
|
42 |
+
FROM presse
|
43 |
+
GROUP BY 1
|
44 |
+
`);
|
45 |
+
```
|
46 |
+
|
47 |
+
```js echo
|
48 |
+
display(
|
49 |
+
Plot.plot({
|
50 |
+
color: { legend: true },
|
51 |
+
marks: [
|
52 |
+
Plot.rectY(
|
53 |
+
years,
|
54 |
+
Plot.binX(
|
55 |
+
{ y: "count" },
|
56 |
+
{
|
57 |
+
filter: (d) =>
|
58 |
+
d.start?.getUTCFullYear() >= 1930 &&
|
59 |
+
d.start?.getUTCFullYear() <= 1955,
|
60 |
+
x: "start",
|
61 |
+
fill: () => "started",
|
62 |
+
interval: "year",
|
63 |
+
}
|
64 |
+
)
|
65 |
+
),
|
66 |
+
Plot.rectY(
|
67 |
+
years,
|
68 |
+
Plot.binX(
|
69 |
+
{ y: "count" },
|
70 |
+
{
|
71 |
+
filter: (d) =>
|
72 |
+
d.end?.getUTCFullYear() >= 1930 &&
|
73 |
+
d.end?.getUTCFullYear() <= 1955,
|
74 |
+
x: "end",
|
75 |
+
fill: () => "ended",
|
76 |
+
mixBlendMode: "multiply",
|
77 |
+
interval: "year",
|
78 |
+
}
|
79 |
+
)
|
80 |
+
),
|
81 |
+
Plot.ruleY([0]),
|
82 |
+
],
|
83 |
+
})
|
84 |
+
);
|
85 |
+
```
|
86 |
+
|
87 |
+
Let’s focus on the ${start1944.length} publications that started publishing in 1944, and extract their titles and authors:
|
88 |
+
|
89 |
+
```js echo
|
90 |
+
const start1944 = db.query(`
|
91 |
+
SELECT title
|
92 |
+
, CASE WHEN author='None' THEN '' ELSE author END AS author
|
93 |
+
, DATE_PART('year', MIN(year)) AS start
|
94 |
+
, DATE_PART('year', MAX(year)) AS end
|
95 |
+
, COUNT(*) AS issues
|
96 |
+
FROM presse
|
97 |
+
GROUP BY 1, 2
|
98 |
+
HAVING DATE_PART('year', MIN(year)) = 1944
|
99 |
+
ORDER BY issues DESC
|
100 |
+
`);
|
101 |
+
```
|
102 |
+
|
103 |
+
```js
|
104 |
+
display(Inputs.table(start1944));
|
105 |
+
```
|
106 |
+
|
107 |
+
Going through these titles, one gets a pretty impressive picture of the publishing activity in this extreme historic period.
|