Kale

Zyn123

AI & ML interests

None yet

Recent Activity

upvoted an article 27 days ago

Decoding Strategies in Large Language Models

upvoted an article about 1 month ago

liked a model about 1 month ago

EmergentMethods/gliner_medium_news-v2.1

View all activity

Organizations

None yet

Zyn123's activity

upvoted an article 27 days ago

Article

Decoding Strategies in Large Language Models

•

Oct 29

• 38

upvoted an article about 1 month ago

Article

Fine-tune Llama 2 with DPO

Aug 8, 2023

• 34

liked a model about 1 month ago

EmergentMethods/gliner_medium_news-v2.1

Token Classification • Updated Jun 18 • 635k • 70

upvoted an article about 1 month ago

Article

How to build a custom text classifier without days of human labeling

•

Oct 17

• 55

upvoted an article about 2 months ago

Article

Fine-tuning LLMs to 1.58bit: extreme quantization made easy

Sep 18

• 205

upvoted 3 articles 3 months ago

Article

Fine-Tune Whisper with 🤗 Transformers

Nov 3, 2022

• 121

Article

Llama-3.1-Storm-8B: Improved SLM with Self-Curation + Model Merging

•

Aug 19

• 73

Article

Merge Large Language Models with mergekit

•

Jan 9

• 82

upvoted an article 4 months ago

Article

TGI Multi-LoRA: Deploy Once, Serve 30 Models

Jul 18

• 51

liked a model 5 months ago

csdc-atl/dialogue-rewriter

Text2Text Generation • Updated Oct 16, 2023 • 16 • 13

upvoted a paper 5 months ago

LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs

Paper • 2406.15319 • Published Jun 21 • 62

upvoted 2 articles 6 months ago

Article

makeMoE: Implement a Sparse Mixture of Experts Language Model from Scratch

•

May 7

• 40

Article

Everything About Long Context Fine-tuning

•

May 10

• 32

upvoted a paper 6 months ago

Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models

Paper • 2405.01535 • Published May 2 • 118

upvoted an article 6 months ago

Article

Let's talk about LLM evaluation

•

May 23

• 137

upvoted an article 7 months ago

Article

Mergoo: Efficiently Build Your Own MoE LLM

•

Jun 3

• 42

liked 2 models 8 months ago

CohereForAI/c4ai-command-r-v01

Text Generation • Updated Sep 27 • 6.99k • 1.07k

llama-moe/LLaMA-MoE-v1-3_5B-2_8

Text Generation • Updated Jun 25 • 235 • 15

Reacted to Molbap's post with 🔥 8 months ago

Post

5014

🚀🚀 Exciting times for the document AI community!

We're thrilled to announce the release of some of the largest OCR datasets available to the public.
🔥 With over 26 million pages , 18 billion text tokens, and 6TB of data, these resources are a significant leap forward for document AI research.

Here's how to access these datasets quickly:

from datasets import load_dataset

pdfa_dataset = load_dataset('pixparse/pdfa-eng-wds', streaming=True)
IDL_dataset = load_dataset('pixparse/idl-wds', streaming=True)

This enables you to stream them directly, integrating seamlessly with your projects using the Hugging Face datasets library. On the hub, you can find them here:

pixparse/pdfa-eng-wds
pixparse/idl-wds

For lean data loading, the new [chug](https://github.com/huggingface/chug) library offers a solution with pdf decoding:

import chug

task_cfg = chug.DataTaskDocReadCfg(
    page_sampling='all',
)
data_cfg = chug.DataCfg(
    source='pixparse/pdfa-eng-wds',
    split='train',
    batch_size=None,
    format='hfids',
    num_workers=0,
)
data_loader = chug.create_loader(
    data_cfg,
    task_cfg,
)
sample = next(iter(data_loader))

We owe a huge thank you to Peter Wyatt, Kate Tasker, Rachel Taketa, Ali Furkan Biten, Ruben Tito, and their colleagues for their contributions. Their work putting these datasets together has been invaluable. 🤗

Looking Ahead:

We're on a mission to enhance document AI capabilities, and these datasets are just the beginning. With your engagement and innovation, we're confident in the community's ability to develop robust OCR solutions. We encourage you to explore these datasets, experiment with the code, and contribute to the collective progress in document AI.

For detailed information on usage and licensing, please refer to the dataset cards on the Hugging Face hub.