Announcing Finance Commons and the Bad Data Toolbox: Pioneering Open Data and Advanced Document Processing Jul 19 • 17
UnifiedCrawl: Aggregated Common Crawl for Affordable Adaptation of LLMs on Low-Resource Languages Paper • 2411.14343 • Published Nov 21 • 7
view article Article Releasing the largest multilingual open pretraining dataset By Pclanglais • Nov 13 • 98
Toxicity of the Commons: Curating Open-Source Pre-Training Data Paper • 2410.22587 • Published Oct 29 • 10
view article Article OCR Processing and Text in Image Analysis with Florence-2-base and Qwen2-VL-2B By PandorAI1995 • Oct 18 • 13
view article Article The case for specialized pre-training: ultra-fast foundation models for dedicated tasks By Pclanglais • Aug 4 • 27
OpenCulture Collection A multilingual dataset of public domain books and newspapers. • 27 items • Updated Nov 6 • 121