khang119966 commited on
Commit
0dcdd1a
β€’
1 Parent(s): fe9a4b1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -0
README.md CHANGED
@@ -38,6 +38,11 @@ tags:
38
  - Improved recognition of specific Vietnamese images because of the [5CD-AI/Viet-Localization-VQA](https://huggingface.co/datasets/5CD-AI/Viet-Localization-VQA) dataset.
39
  - Better balance between General VQA and Text/Document VQA.
40
 
 
 
 
 
 
41
  **We aim to:** Vietnamese soul in every token!
42
 
43
  We are excited to introduce **Vintern-1B-v3** the Vietnamese πŸ‡»πŸ‡³ multimodal model that combines the advanced Vietnamese language model [Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct)[1] with the latest visual model, [InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px)[2], CVPR 2024. This model excels in tasks such as OCR-VQA, Doc-VQA, and Chart-VQA,... With only 1 billion parameters, it is **4096 context length** finetuned from the [InternVL2-1B](https://huggingface.co/OpenGVLab/InternVL2-1B) model on over 5 million specialized image-question-answer pairs for optical character recognition πŸ”, text recognition πŸ”€, document extraction πŸ“‘, and general VQA. The model can be integrated into various on-device applications πŸ“±, demonstrating its versatility and robust capabilities.
 
38
  - Improved recognition of specific Vietnamese images because of the [5CD-AI/Viet-Localization-VQA](https://huggingface.co/datasets/5CD-AI/Viet-Localization-VQA) dataset.
39
  - Better balance between General VQA and Text/Document VQA.
40
 
41
+ How to Choose Between v2 and v3:
42
+ - Choose v2 if you are focusing on OCR and Doc VQA.
43
+ - Choose v3 if you are focusing on General VQA.
44
+
45
+
46
  **We aim to:** Vietnamese soul in every token!
47
 
48
  We are excited to introduce **Vintern-1B-v3** the Vietnamese πŸ‡»πŸ‡³ multimodal model that combines the advanced Vietnamese language model [Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct)[1] with the latest visual model, [InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px)[2], CVPR 2024. This model excels in tasks such as OCR-VQA, Doc-VQA, and Chart-VQA,... With only 1 billion parameters, it is **4096 context length** finetuned from the [InternVL2-1B](https://huggingface.co/OpenGVLab/InternVL2-1B) model on over 5 million specialized image-question-answer pairs for optical character recognition πŸ”, text recognition πŸ”€, document extraction πŸ“‘, and general VQA. The model can be integrated into various on-device applications πŸ“±, demonstrating its versatility and robust capabilities.