Anastasia Stasenko's picture
5 3

Anastasia Stasenko

anastasiastasenko

AI & ML interests

LLM for humanities, social sciences and public good Ethical AI

Recent Activity

updated a model 29 days ago
PleIAs/Pleias-Pico
updated a model 29 days ago
PleIAs/Pleias-Nano
View all activity

Articles

Organizations

AgentPublic's profile picture Women on Hugging Face's profile picture PleIAs's profile picture Social Post Explorers's profile picture

anastasiastasenko's activity

upvoted an article 29 days ago
upvoted an article about 2 months ago
view article
Article

Releasing the largest multilingual open pretraining dataset

98
upvoted an article 2 months ago
upvoted an article 3 months ago
reacted to clem's post with ❤️ 11 months ago
reacted to Pclanglais's post with ❤️ 11 months ago
view post
Post
Hi everyone,
For my first post, I'm announcing a big release (in multiple ways): probably the largest open corpus in French to date, with 85 billion words in the public domain.
The dataset has been prepared in collaboration with Benoît de Courson and Benjamin Azoulay from Gallicagram (https://shiny.ens-paris-saclay.fr/app/gallicagram). Gallicagram is a major cultural analytics project in French, the open and better version of ngram viewer for large scale search of word and ngram occurrences.
The corpus is made of two different dataset for monographs (16B words) PleIAs/French-PD-Newspapers and newspapers/periodicals (69B) PleIAs/French-PD-Newspapers Along with the full text it also includes core provenance metadata.
Beyond research in digital humanities, the corpus can also be used to training open and reproducible LLMs. Being in the public domain means it can be released everywhere in any shape without restrictions.
The corpus is not perfect: digitization of cultural heritage is challenging and, especially for newspapers, we tackle with layout issues and a significant rate of optical character recognition mistake. Our conviction is that releasing corpus as a commons is the best way to improve on this. Sharing is caring.
  • 1 reply
·