fpdn / docs /index.md
fil's picture
320!
8abbe81
|
raw
history blame
4.52 kB
# Presse-public-domain
I'm fascinated by this new dataset https://huggingface.co/datasets/PleIAs/French-PD-Newspapers that just dropped on :hugging_face:.
It references about 3 million French newspapers, with full-text. ("only" 320 files of about 700MB each :slightly_smiling_face:).
I wrote a data loader that outputs a single parquet file combining all their metadata (that is, without the text contents). It takes only about 5 minutes to run, thanks to parquet magic (and fiber internet); not sure how much of the ~200+GB I downloaded for that. Now I've started exploring the metadata in an observable project.
In one query, I see that A LOT of publications stopped publishing in 1944/1945, and conversely a large number of newspapers started publishing between 1941 and 1946. This probably includes both collaborationist publications and resistance publications—it would be interesting to find a ML way to separate them.
In another exploration, I see that the word "gazette" was quite frequent in the 17th Century.
Not sure what to look for, but making these queries in an Observable project is really nice.
What I need to understand:
- how to run the merging data loader in the (HF) cloud (instead of doing it on my computer).
- how to run the build step on HF (or can this only be deployed on Observable?)
```js
import { DuckDBClient } from "npm:@observablehq/duckdb";
const db = DuckDBClient.of({ presse: FileAttachment("data/presse.parquet") });
```
```js
const selection = db.query(
`SELECT * FROM presse WHERE date >= '1600' AND date < '1820'`
);
```
```js
const letters = [];
const dates = [];
for (const { date, title } of selection) {
if (title) {
for (let l of [...title]) {
l = l.toUpperCase();
if (l >= "A" && l <= "Z") {
const d = +d3.isoParse(date);
if (d) {
dates.push(d);
letters.push(l);
}
}
}
}
}
```
```js
// display(letters);
// display(dates);
```
```js
display(
Plot.plot({
marks: [
Plot.areaY(
letters,
Plot.stackY(
{ offset: "normalize", _o_rder: "value" },
Plot.binX(
{
y: "count",
filter: null,
},
{
x: dates,
thresholds: "5 years",
fill: letters,
tip: true,
}
)
)
),
],
})
);
```
```js
const search = view(Inputs.text({ type: "search", placeholder: "search…" }));
```
```js
display(selection);
```
```js
const test = new RegExp(search, "i");
```
```js
display(
Plot.plot({
marginLeft: 60,
marks: [
Plot.rectY(
selection,
Plot.binX(
{
y: "count",
},
{
x: "date",
fill: (d) => d.title && test.test(d.title),
thresholds: "1 year",
}
)
),
],
})
);
```
```js
const authors = db.query(
`
SELECT author
, MIN(date) AS start
, MAX(date) AS end
FROM presse
WHERE date <> 'None'
GROUP BY 1
`
);
```
```js
display(authors);
```
```js
display(
Plot.plot({
x: { type: "utc" },
marks: [
Plot.ruleY(authors, {
x1: "start",
x2: "end",
y: "author",
stroke: (d) => d3.isoParse(d.end) - d3.isoParse(d.start),
sort: {
y: "stroke",
},
}),
],
})
);
```
```js
display(
Plot.plot({
x: { type: "utc" },
marks: [
Plot.rectY(
authors,
Plot.binX(
{ y: "count" },
{
x: "start",
fill: (d) => d3.isoParse(d.end)?.getUTCFullYear(),
thresholds: "1 year",
tip: true,
}
)
),
],
})
);
```
```js
display(
Plot.plot({
x: { type: "utc" },
marks: [
Plot.rectY(
authors,
Plot.binX(
{ y: "count" },
{
x: "end",
fill: (d) => d3.isoParse(d.start)?.getUTCFullYear(),
thresholds: "1 year",
tip: true,
}
)
),
],
})
);
```
```js
const aroundWar = db.query(
`SELECT * FROM presse WHERE date >= '1920' AND date < '1970'`
);
```
```js
display(
Plot.plot({
x: { type: "utc" },
marks: [
Plot.rectY(
aroundWar,
Plot.binX(
{ y: "count" },
{
x: "date",
thresholds: "1 year",
tip: true,
}
)
),
],
})
);
```