fpdn / docs /index.md
fil's picture
todo => done
bb6df88
|
raw
history blame
4.32 kB

Presse-public-domain

I'm fascinated by this new dataset https://huggingface.co/datasets/PleIAs/French-PD-Newspapers that just dropped on :hugging_face:.

It references about 3 million French newspapers, with full-text. ("only" 320 files of about 700MB each :slightly_smiling_face:).

I wrote a data loader that outputs a single parquet file combining all their metadata (that is, without the text contents). It takes only about 5 minutes to run, thanks to parquet magic (and fiber internet); not sure how much of the ~200+GB I downloaded for that. Now I've started exploring the metadata in an observable project.

In one query, I see that A LOT of publications stopped publishing in 1944/1945, and conversely a large number of newspapers started publishing between 1941 and 1946. This probably includes both collaborationist publications and resistance publications—it would be interesting to find a ML way to separate them.

In another exploration, I see that the word "gazette" was quite frequent in the 17th Century.

Not sure what to look for, but making these queries in an Observable project is really nice.

import { DuckDBClient } from "npm:@observablehq/duckdb";
const db = DuckDBClient.of({ presse: FileAttachment("data/presse.parquet") });
const selection = db.query(
  `SELECT * FROM presse WHERE date >= '1600' AND date < '1820'`
);
const letters = [];
const dates = [];

for (const { date, title } of selection) {
  if (title) {
    for (let l of [...title]) {
      l = l.toUpperCase();
      if (l >= "A" && l <= "Z") {
        const d = +d3.isoParse(date);
        if (d) {
          dates.push(d);
          letters.push(l);
        }
      }
    }
  }
}
// display(letters);
// display(dates);
display(
  Plot.plot({
    marks: [
      Plot.areaY(
        letters,
        Plot.stackY(
          { offset: "normalize", _o_rder: "value" },
          Plot.binX(
            {
              y: "count",
              filter: null,
            },
            {
              x: dates,
              thresholds: "5 years",
              fill: letters,
              tip: true,
            }
          )
        )
      ),
    ],
  })
);
const search = view(Inputs.text({ type: "search", placeholder: "search…" }));
display(selection);
const test = new RegExp(search, "i");
display(
  Plot.plot({
    marginLeft: 60,
    marks: [
      Plot.rectY(
        selection,
        Plot.binX(
          {
            y: "count",
          },
          {
            x: "date",
            fill: (d) => d.title && test.test(d.title),
            thresholds: "1 year",
          }
        )
      ),
    ],
  })
);
const authors = db.query(
  `
SELECT author
     , MIN(date) AS start
     , MAX(date) AS end
  FROM presse
 WHERE date <> 'None'
 GROUP BY 1
`
);
display(authors);
display(
  Plot.plot({
    x: { type: "utc" },
    marks: [
      Plot.ruleY(authors, {
        x1: "start",
        x2: "end",
        y: "author",
        stroke: (d) => d3.isoParse(d.end) - d3.isoParse(d.start),
        sort: {
          y: "stroke",
        },
      }),
    ],
  })
);
display(
  Plot.plot({
    x: { type: "utc" },
    marks: [
      Plot.rectY(
        authors,
        Plot.binX(
          { y: "count" },
          {
            x: "start",
            fill: (d) => d3.isoParse(d.end)?.getUTCFullYear(),
            thresholds: "1 year",
            tip: true,
          }
        )
      ),
    ],
  })
);
display(
  Plot.plot({
    x: { type: "utc" },
    marks: [
      Plot.rectY(
        authors,
        Plot.binX(
          { y: "count" },
          {
            x: "end",
            fill: (d) => d3.isoParse(d.start)?.getUTCFullYear(),
            thresholds: "1 year",
            tip: true,
          }
        )
      ),
    ],
  })
);
const aroundWar = db.query(
  `SELECT * FROM presse WHERE date >= '1920' AND date < '1970'`
);
display(
  Plot.plot({
    x: { type: "utc" },
    marks: [
      Plot.rectY(
        aroundWar,
        Plot.binX(
          { y: "count" },
          {
            x: "date",
            thresholds: "1 year",
            tip: true,
          }
        )
      ),
    ],
  })
);