fpdn / docs /index.md
fil's picture
fix link css 👹
9346170
|
raw
history blame
4.95 kB

French public domain newspapers

A quick glance at 3 million periodicals

by Fil

This new fascinating dataset just dropped on Hugging Face : French public domain newspapers 🤗 references about 3 million newspapers and periodicals with their full text OCR’ed and some meta-data.

The data is stored in 320 large parquet files. The data loader for this Observable framework project uses DuckDB to read these files (altogether about 200GB) and combines a minimal subset of their metadata — title and year of publication, most importantly without the text contents —, into a single highly optimized parquet file. This takes only about 1 minute to run in a hugging-face Space.

The resulting file is small enough (about 8MB) that we can load it in the browser and create “live” charts with Observable Plot.

In this project, I’m exploring two aspects of the dataset:

  • As I played with the titles, I saw that the word “gazette” was quite frequent in the 17th Century. An exploration of the words used in the titles is on the page gazette.

  • A lot of publications stopped or started publishing during the second world war. Explored in resistance.

This page summarizes the time distribution of the data:

import { DuckDBClient } from "npm:@observablehq/duckdb";
const db = DuckDBClient.of({ presse: FileAttachment("data/presse.parquet") });
const dates = db.query("SELECT year FROM presse WHERE year >= '1000'");

Note: due to the date pattern matching I’m using, unknown years are marked as 0000. Hence the filter above.

The chart below indicates that the bulk of the contents collected in this database was published between 1850 and 1950. It’s obviously not that the presse stopped after 1950, but because most of the printed world after that threshold year is still out of reach of researchers, as it is “protected” by copyright or droit d’auteur.

${Plot.rectY(dates, Plot.binX({ y: "count" }, { x: "year", interval: "5 years" })).plot({ marginLeft: 60 })}

Plot.plot({
  marks: [
    Plot.rectY(
      dates,
      Plot.binX({ y: "count" }, { x: "year", interval: "5 years" })
    ),
  ],
});