khang119966
commited on
Commit
β’
0dcdd1a
1
Parent(s):
fe9a4b1
Update README.md
Browse files
README.md
CHANGED
@@ -38,6 +38,11 @@ tags:
|
|
38 |
- Improved recognition of specific Vietnamese images because of the [5CD-AI/Viet-Localization-VQA](https://huggingface.co/datasets/5CD-AI/Viet-Localization-VQA) dataset.
|
39 |
- Better balance between General VQA and Text/Document VQA.
|
40 |
|
|
|
|
|
|
|
|
|
|
|
41 |
**We aim to:** Vietnamese soul in every token!
|
42 |
|
43 |
We are excited to introduce **Vintern-1B-v3** the Vietnamese π»π³ multimodal model that combines the advanced Vietnamese language model [Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct)[1] with the latest visual model, [InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px)[2], CVPR 2024. This model excels in tasks such as OCR-VQA, Doc-VQA, and Chart-VQA,... With only 1 billion parameters, it is **4096 context length** finetuned from the [InternVL2-1B](https://huggingface.co/OpenGVLab/InternVL2-1B) model on over 5 million specialized image-question-answer pairs for optical character recognition π, text recognition π€, document extraction π, and general VQA. The model can be integrated into various on-device applications π±, demonstrating its versatility and robust capabilities.
|
|
|
38 |
- Improved recognition of specific Vietnamese images because of the [5CD-AI/Viet-Localization-VQA](https://huggingface.co/datasets/5CD-AI/Viet-Localization-VQA) dataset.
|
39 |
- Better balance between General VQA and Text/Document VQA.
|
40 |
|
41 |
+
How to Choose Between v2 and v3:
|
42 |
+
- Choose v2 if you are focusing on OCR and Doc VQA.
|
43 |
+
- Choose v3 if you are focusing on General VQA.
|
44 |
+
|
45 |
+
|
46 |
**We aim to:** Vietnamese soul in every token!
|
47 |
|
48 |
We are excited to introduce **Vintern-1B-v3** the Vietnamese π»π³ multimodal model that combines the advanced Vietnamese language model [Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct)[1] with the latest visual model, [InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px)[2], CVPR 2024. This model excels in tasks such as OCR-VQA, Doc-VQA, and Chart-VQA,... With only 1 billion parameters, it is **4096 context length** finetuned from the [InternVL2-1B](https://huggingface.co/OpenGVLab/InternVL2-1B) model on over 5 million specialized image-question-answer pairs for optical character recognition π, text recognition π€, document extraction π, and general VQA. The model can be integrated into various on-device applications π±, demonstrating its versatility and robust capabilities.
|