Announcing Finance Commons and the Bad Data Toolbox: Pioneering Open Data and Advanced Document Processing Jul 19 • 17
view article Article Releasing the largest multilingual open pretraining dataset By Pclanglais • 9 days ago • 94
Toxicity of the Commons: Curating Open-Source Pre-Training Data Paper • 2410.22587 • Published 23 days ago • 8
view article Article OCR Processing and Text in Image Analysis with Florence-2-base and Qwen2-VL-2B By PandorAI1995 • Oct 18 • 13
view article Article The case for specialized pre-training: ultra-fast foundation models for dedicated tasks By Pclanglais • Aug 4 • 26
OpenCulture Collection A multilingual dataset of public domain books and newspapers. • 27 items • Updated 15 days ago • 117