# Presse-public-domain I'm fascinated by this new dataset https://huggingface.co/datasets/PleIAs/French-PD-Newspapers that just dropped on :hugging_face:. It references about 3 million French newspapers, with full-text. ("only" 320 files of about 700MB each :slightly_smiling_face:). I wrote a data loader that outputs a single parquet file combining all their metadata (that is, without the text contents). It takes only about 5 minutes to run, thanks to parquet magic (and fiber internet); not sure how much of the ~200+GB I downloaded for that. Now I've started exploring the metadata in an observable project. In one query, I see that A LOT of publications stopped publishing in 1944/1945, and conversely a large number of newspapers started publishing between 1941 and 1946. This probably includes both collaborationist publications and resistance publications—it would be interesting to find a ML way to separate them. In another exploration, I see that the word "gazette" was quite frequent in the 17th Century. Not sure what to look for, but making these queries in an Observable project is really nice. What I need to understand: - how to run the merging data loader in the (HF) cloud (instead of doing it on my computer). - how to run the build step on HF (or can this only be deployed on Observable?) ```js import { DuckDBClient } from "npm:@observablehq/duckdb"; const db = DuckDBClient.of({ presse: FileAttachment("data/presse.parquet") }); ``` ```js const selection = db.query( `SELECT * FROM presse WHERE date >= '1600' AND date < '1820'` ); ``` ```js const letters = []; const dates = []; for (const { date, title } of selection) { if (title) { for (let l of [...title]) { l = l.toUpperCase(); if (l >= "A" && l <= "Z") { const d = +d3.isoParse(date); if (d) { dates.push(d); letters.push(l); } } } } } ``` ```js // display(letters); // display(dates); ``` ```js display( Plot.plot({ marks: [ Plot.areaY( letters, Plot.stackY( { offset: "normalize", _o_rder: "value" }, Plot.binX( { y: "count", filter: null, }, { x: dates, thresholds: "5 years", fill: letters, tip: true, } ) ) ), ], }) ); ``` ```js const search = view(Inputs.text({ type: "search", placeholder: "search…" })); ``` ```js display(selection); ``` ```js const test = new RegExp(search, "i"); ``` ```js display( Plot.plot({ marginLeft: 60, marks: [ Plot.rectY( selection, Plot.binX( { y: "count", }, { x: "date", fill: (d) => d.title && test.test(d.title), thresholds: "1 year", } ) ), ], }) ); ``` ```js const authors = db.query( ` SELECT author , MIN(date) AS start , MAX(date) AS end FROM presse WHERE date <> 'None' GROUP BY 1 ` ); ``` ```js display(authors); ``` ```js display( Plot.plot({ x: { type: "utc" }, marks: [ Plot.ruleY(authors, { x1: "start", x2: "end", y: "author", stroke: (d) => d3.isoParse(d.end) - d3.isoParse(d.start), sort: { y: "stroke", }, }), ], }) ); ``` ```js display( Plot.plot({ x: { type: "utc" }, marks: [ Plot.rectY( authors, Plot.binX( { y: "count" }, { x: "start", fill: (d) => d3.isoParse(d.end)?.getUTCFullYear(), thresholds: "1 year", tip: true, } ) ), ], }) ); ``` ```js display( Plot.plot({ x: { type: "utc" }, marks: [ Plot.rectY( authors, Plot.binX( { y: "count" }, { x: "end", fill: (d) => d3.isoParse(d.start)?.getUTCFullYear(), thresholds: "1 year", tip: true, } ) ), ], }) ); ``` ```js const aroundWar = db.query( `SELECT * FROM presse WHERE date >= '1920' AND date < '1970'` ); ``` ```js display( Plot.plot({ x: { type: "utc" }, marks: [ Plot.rectY( aroundWar, Plot.binX( { y: "count" }, { x: "date", thresholds: "1 year", tip: true, } ) ), ], }) ); ```