Dias Balmash's picture
6

Dias Balmash

diasbalmash

AI & ML interests

None yet

Recent Activity

liked a dataset about 1 month ago
farabi-lab/kazakh-stt
liked a model about 2 months ago
nvidia/stt_kk_ru_fastconformer_hybrid_large
View all activity

Organizations

None yet

diasbalmash's activity

Reacted to merve's post with ๐Ÿš€ 5 months ago
view post
Post
3232
Forget any document retrievers, use ColPali ๐Ÿ’ฅ๐Ÿ’ฅ

Document retrieval is done through OCR + layout detection, but you are losing a lot of information in between, stop doing that! ๐Ÿค“

ColPali uses a vision language model, which is better in doc understanding ๐Ÿ“‘
ColPali: vidore/colpali (mit license!)
Blog post: https://huggingface.co/blog/manu/colpali
The authors also released a new benchmark for document retrieval:
ViDoRe Benchmark: vidore/vidore-benchmark-667173f98e70a1c0fa4db00d
ViDoRe Leaderboard: vidore/vidore-leaderboard

ColPali marries the idea of modern vision language models with retrieval ๐Ÿค

The authors apply contrastive fine-tuning to SigLIP on documents, and pool the outputs (they call it BiSigLip). Then they feed the patch embedding outputs to PaliGemma and create BiPali ๐Ÿ–‡๏ธ
BiPali natively supports image patch embeddings to an LLM, which enables leveraging the ColBERT-like late interaction computations between text tokens and image patches (hence the name ColPali!) ๐Ÿคฉ

The authors created the ViDoRe benchmark by collecting PDF documents and generate queries from Claude-3 Sonnet.
ColPali seems to be the most performant model on ViDoRe. Not only this, but is way faster than traditional PDF parsers too!