Spaces:
Running
Running
# Presse-public-domain | |
I'm fascinated by this new dataset https://huggingface.co/datasets/PleIAs/French-PD-Newspapers that just dropped on :hugging_face:. | |
It references about 3 million French newspapers, with full-text. ("only" 320 files of about 700MB each :slightly_smiling_face:). | |
I wrote a data loader that outputs a single parquet file combining all their metadata (that is, without the text contents). It takes only about 5 minutes to run, thanks to parquet magic (and fiber internet); not sure how much of the ~200+GB I downloaded for that. Now I've started exploring the metadata in an observable project. | |
In one query, I see that A LOT of publications stopped publishing in 1944/1945, and conversely a large number of newspapers started publishing between 1941 and 1946. This probably includes both collaborationist publications and resistance publications—it would be interesting to find a ML way to separate them. | |
In another exploration, I see that the word "gazette" was quite frequent in the 17th Century. | |
Not sure what to look for, but making these queries in an Observable project is really nice. | |
What I need to understand: | |
- how to run the merging data loader in the (HF) cloud (instead of doing it on my computer). | |
- how to run the build step on HF (or can this only be deployed on Observable?) | |
```js | |
import { DuckDBClient } from "npm:@observablehq/duckdb"; | |
const db = DuckDBClient.of({ presse: FileAttachment("data/presse.parquet") }); | |
``` | |
```js | |
const selection = db.query( | |
`SELECT * FROM presse WHERE date >= '1600' AND date < '1820'` | |
); | |
``` | |
```js | |
const letters = []; | |
const dates = []; | |
for (const { date, title } of selection) { | |
if (title) { | |
for (let l of [...title]) { | |
l = l.toUpperCase(); | |
if (l >= "A" && l <= "Z") { | |
const d = +d3.isoParse(date); | |
if (d) { | |
dates.push(d); | |
letters.push(l); | |
} | |
} | |
} | |
} | |
} | |
``` | |
```js | |
// display(letters); | |
// display(dates); | |
``` | |
```js | |
display( | |
Plot.plot({ | |
marks: [ | |
Plot.areaY( | |
letters, | |
Plot.stackY( | |
{ offset: "normalize", _o_rder: "value" }, | |
Plot.binX( | |
{ | |
y: "count", | |
filter: null, | |
}, | |
{ | |
x: dates, | |
thresholds: "5 years", | |
fill: letters, | |
tip: true, | |
} | |
) | |
) | |
), | |
], | |
}) | |
); | |
``` | |
```js | |
const search = view(Inputs.text({ type: "search", placeholder: "search…" })); | |
``` | |
```js | |
display(selection); | |
``` | |
```js | |
const test = new RegExp(search, "i"); | |
``` | |
```js | |
display( | |
Plot.plot({ | |
marginLeft: 60, | |
marks: [ | |
Plot.rectY( | |
selection, | |
Plot.binX( | |
{ | |
y: "count", | |
}, | |
{ | |
x: "date", | |
fill: (d) => d.title && test.test(d.title), | |
thresholds: "1 year", | |
} | |
) | |
), | |
], | |
}) | |
); | |
``` | |
```js | |
const authors = db.query( | |
` | |
SELECT author | |
, MIN(date) AS start | |
, MAX(date) AS end | |
FROM presse | |
WHERE date <> 'None' | |
GROUP BY 1 | |
` | |
); | |
``` | |
```js | |
display(authors); | |
``` | |
```js | |
display( | |
Plot.plot({ | |
x: { type: "utc" }, | |
marks: [ | |
Plot.ruleY(authors, { | |
x1: "start", | |
x2: "end", | |
y: "author", | |
stroke: (d) => d3.isoParse(d.end) - d3.isoParse(d.start), | |
sort: { | |
y: "stroke", | |
}, | |
}), | |
], | |
}) | |
); | |
``` | |
```js | |
display( | |
Plot.plot({ | |
x: { type: "utc" }, | |
marks: [ | |
Plot.rectY( | |
authors, | |
Plot.binX( | |
{ y: "count" }, | |
{ | |
x: "start", | |
fill: (d) => d3.isoParse(d.end)?.getUTCFullYear(), | |
thresholds: "1 year", | |
tip: true, | |
} | |
) | |
), | |
], | |
}) | |
); | |
``` | |
```js | |
display( | |
Plot.plot({ | |
x: { type: "utc" }, | |
marks: [ | |
Plot.rectY( | |
authors, | |
Plot.binX( | |
{ y: "count" }, | |
{ | |
x: "end", | |
fill: (d) => d3.isoParse(d.start)?.getUTCFullYear(), | |
thresholds: "1 year", | |
tip: true, | |
} | |
) | |
), | |
], | |
}) | |
); | |
``` | |
```js | |
const aroundWar = db.query( | |
`SELECT * FROM presse WHERE date >= '1920' AND date < '1970'` | |
); | |
``` | |
```js | |
display( | |
Plot.plot({ | |
x: { type: "utc" }, | |
marks: [ | |
Plot.rectY( | |
aroundWar, | |
Plot.binX( | |
{ y: "count" }, | |
{ | |
x: "date", | |
thresholds: "1 year", | |
tip: true, | |
} | |
) | |
), | |
], | |
}) | |
); | |
``` | |