Post
1078
Since it is release season, at PleIAs we announce our first suite of specialized language models for document processing tasks (OCR correction, text segmentation, bibliographic extraction) and the release of the largest multimodal dataset of financial document Finance Commons: https://huggingface.co/blog/Pclanglais/finance-commons-bad-data-toolbox
LLM research is currently focused on quality data. We went on the opposite direction and voluntarily trained models on bad data. Far from degrading models, it made them more resilient to text sources commonly used in production.
Having a wider range of real life data proved critical for this project. A few months after the release of Common Corpus, we expanded our pool of "training data commons" with a major multimodal ressource: document released as open financial data. Finance commons comprises 17 billion tokens and 1.25 PDF corporate documents released by the SEC, WTO, AMF, EU Tenders In a multiple languages with a large variety of document layouts and challenging sources to train more robust models.
With HuggingFace compute support, we release an entire pipeline to process bad data sources and make them usable in production for LLMOps or simply retrieval: PleIAs/PleIAs-Editor
This approach is based on our new series of specialized models for document processing, the "bad data toolbox" comprising:
*OCRonos, the best available model to date for OCR correction. PleIAs/OCRonos
*Segmentext, a pure semantic small model for text segmentation, working without any visual reference. PleIAs/Segmentext
*Bibtexer, a small model for bibliographic data extraction acting as a "reversed-Zotero." PleIAs/BibTexer
LLM research is currently focused on quality data. We went on the opposite direction and voluntarily trained models on bad data. Far from degrading models, it made them more resilient to text sources commonly used in production.
Having a wider range of real life data proved critical for this project. A few months after the release of Common Corpus, we expanded our pool of "training data commons" with a major multimodal ressource: document released as open financial data. Finance commons comprises 17 billion tokens and 1.25 PDF corporate documents released by the SEC, WTO, AMF, EU Tenders In a multiple languages with a large variety of document layouts and challenging sources to train more robust models.
With HuggingFace compute support, we release an entire pipeline to process bad data sources and make them usable in production for LLMOps or simply retrieval: PleIAs/PleIAs-Editor
This approach is based on our new series of specialized models for document processing, the "bad data toolbox" comprising:
*OCRonos, the best available model to date for OCR correction. PleIAs/OCRonos
*Segmentext, a pure semantic small model for text segmentation, working without any visual reference. PleIAs/Segmentext
*Bibtexer, a small model for bibliographic data extraction acting as a "reversed-Zotero." PleIAs/BibTexer