39 11 85

Pierre-Carl Langlais

Pclanglais

Dorialexander

AI & ML interests

Open data & open LLMs

Articles

They Said It Couldn’t Be Done

19 days ago

• 75

Releasing the largest multilingual open pretraining dataset

Nov 13

• 98

The case for specialized pre-training: ultra-fast foundation models for dedicated tasks

Aug 4

• 27

Announcing Finance Commons and the Bad Data Toolbox: Pioneering Open Data and Advanced Document Processing

Jul 19

• 17

Post-OCR-Correction: 1 billion words dataset of automated OCR correction by LLM

Apr 26

• 15

Releasing Youtube-Commons: a massive open corpus for conversational and multimodal data

Apr 18

• 22

Releasing Common Corpus: the largest public domain dataset for training LLMs

Mar 20

• 18

Organizations

Pclanglais's activity

upvoted a paper about 1 month ago

UnifiedCrawl: Aggregated Common Crawl for Affordable Adaptation of LLMs on Low-Resource Languages

Paper • 2411.14343 • Published Nov 21 • 7

upvoted an article about 1 month ago

Article

Releasing the largest multilingual open pretraining dataset

•

Nov 13

• 98

upvoted a paper about 2 months ago

Toxicity of the Commons: Curating Open-Source Pre-Training Data

Paper • 2410.22587 • Published Oct 29 • 10

upvoted an article about 2 months ago

Article

Detoxifying the Commons

•

Oct 31

• 6

upvoted an article 2 months ago

Article

OCR Processing and Text in Image Analysis with Florence-2-base and Qwen2-VL-2B

•

Oct 18

• 13

upvoted 2 articles 3 months ago

Article

VLM Art Analysis

•

Oct 4

• 11

Article

wHy DoNt YoU jUsT uSe ThE lLaMa ToKeNiZeR??

•

Sep 27

• 38

upvoted an article 4 months ago

Article

SmolLM - blazingly fast and remarkably powerful

Jul 16

• 292

upvoted an article 5 months ago

Article

The case for specialized pre-training: ultra-fast foundation models for dedicated tasks

•

Aug 4

• 27

upvoted a collection 6 months ago

Common Pile

Collection

Datasets in the Common Pile. • 25 items • Updated Oct 29 • 4

upvoted a collection 9 months ago

OpenCulture

Collection

A multilingual dataset of public domain books and newspapers. • 27 items • Updated Nov 6 • 121