5CD-AI
/

Vintern-1B-v3

Visual Question Answering

feature-extraction

Model card Files Files and versions Community

khang119966 commited on Aug 28

Commit

0dcdd1a

•

1 Parent(s): fe9a4b1

Update README.md

Files changed (1) hide show

README.md +5 -0

README.md CHANGED Viewed

@@ -38,6 +38,11 @@ tags:
 - Improved recognition of specific Vietnamese images because of the [5CD-AI/Viet-Localization-VQA](https://huggingface.co/datasets/5CD-AI/Viet-Localization-VQA) dataset.
 - Better balance between General VQA and Text/Document VQA.
 **We aim to:** Vietnamese soul in every token!
 We are excited to introduce  **Vintern-1B-v3** the Vietnamese 🇻🇳 multimodal model that combines the advanced Vietnamese language model [Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct)[1] with the latest visual model, [InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px)[2], CVPR 2024. This model excels in tasks such as OCR-VQA, Doc-VQA, and Chart-VQA,... With only 1 billion parameters, it is **4096 context length** finetuned from the [InternVL2-1B](https://huggingface.co/OpenGVLab/InternVL2-1B) model on over 5 million specialized image-question-answer pairs for optical character recognition 🔍, text recognition 🔤, document extraction 📑, and general VQA. The model can be integrated into various on-device applications 📱, demonstrating its versatility and robust capabilities.

 - Improved recognition of specific Vietnamese images because of the [5CD-AI/Viet-Localization-VQA](https://huggingface.co/datasets/5CD-AI/Viet-Localization-VQA) dataset.
 - Better balance between General VQA and Text/Document VQA.
+How to Choose Between v2 and v3:
+- Choose v2 if you are focusing on OCR and Doc VQA.
+- Choose v3 if you are focusing on General VQA.
 **We aim to:** Vietnamese soul in every token!
 We are excited to introduce  **Vintern-1B-v3** the Vietnamese 🇻🇳 multimodal model that combines the advanced Vietnamese language model [Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct)[1] with the latest visual model, [InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px)[2], CVPR 2024. This model excels in tasks such as OCR-VQA, Doc-VQA, and Chart-VQA,... With only 1 billion parameters, it is **4096 context length** finetuned from the [InternVL2-1B](https://huggingface.co/OpenGVLab/InternVL2-1B) model on over 5 million specialized image-question-answer pairs for optical character recognition 🔍, text recognition 🔤, document extraction 📑, and general VQA. The model can be integrated into various on-device applications 📱, demonstrating its versatility and robust capabilities.