Ahmed Masry PRO

ahmed-masry

AI & ML interests

Multimodal Chart Understanding, Multimodal Document AI, Multimodal Vision - Language Models,

Recent Activity

liked a dataset 16 days ago
MAmmoTH-VL/MAmmoTH-VL-Instruct-12M
liked a dataset 25 days ago
ServiceNow/BigDocs-Bench
View all activity

Articles

Organizations

Visualizations + NLP's profile picture

Posts 3

view post
Post
1400
πŸš€ Introducing ColFlor: An Efficient, OCR-Free Vision-Language Document Retrieval Model 🌟

Earlier this year, ColPali revolutionized document retrieval by eliminating the need for error-prone OCR pipelines. Instead, it directly processes the document images. However, with its 3 billion parameters, ColPali is computationally heavy for large-scale applications.

That’s where ColFlor comes inβ€”a smaller, faster alternative! πŸŽ‰ At 17x smaller than ColPali, ColFlor offers a more efficient, OCR-free document retrieval solution, making it ideal for users with limited computing resources (GPU Poor). πŸ’‘

Key Highlights:
🧠 174M parameters (vs. 3B for ColPali)
⚑ 9.8x faster query encoding, 5.25x faster image encoding
πŸ“‰ Only 1.8% performance drop on text-rich English documents

Check out the full blog post for more insights on modeling, training, and evaluations across various document retrieval tasks! πŸš€
Also, feel free to try our demo on huggingface πŸ€—

πŸ”— Resources:
πŸ“„ Blog post: https://huggingface.co/blog/ahmed-masry/colflor
🧠 Model: ahmed-masry/ColFlor
πŸ’» Demo: ahmed-masry/ColFlor-Demo
πŸ‹οΈβ€β™‚οΈ Training code: https://github.com/AhmedMasryKU/colflor
πŸ“Š Evaluation code: https://github.com/AhmedMasryKU/vidore-benchmark-colflor
view post
Post
3590
πŸ“’ Exciting News! Our latest paper "ChartGemma" is out! πŸ“Š

🧡1/3: ChartGemma overcomes existing chart models key limitations that rely too much on data tables. Instead, it is trained on data generated directly from chart images, capturing crucial visual trendsπŸ“ΈπŸ”

🧡2/3: ChartGemma builds upon PaliGemma from Google Research and is fine-tuned on a high-quality visual instruction tuning dataset generated from Gemini Flash 1.5. πŸŒŸπŸ“Š

🧡3/3: Achieves state-of-the-art results in chart summarization, question answering, and fact-checking tasks. πŸ…πŸ“Š It can also generate more accurate and realistic chart summaries. πŸ“πŸ”

Our model and data are publicly available. We also have a cool web demo. Check it out! πŸš€
Demo: ahmed-masry/ChartGemma
Code: https://github.com/vis-nlp/ChartGemma
Paper: ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild (2407.04172)