--- base_model: OpenGVLab/InternVL2-1B datasets: - 5CD-AI/Viet-OCR-VQA - 5CD-AI/Viet-Doc-VQA - 5CD-AI/Viet-Doc-VQA-II - Vi-VLM/Vista - 5CD-AI/Viet-Receipt-VQA - 5CD-AI/Viet-Sketches-VQA - 5CD-AI/Viet-Geometry-VQA - 5CD-AI/Viet-Wiki-Handwriting - 5CD-AI/Viet-ComputerScience-VQA - 5CD-AI/Viet-Handwriting-gemini-VQA - 5CD-AI/Viet-Menu-gemini-VQA - 5CD-AI/Viet-Vintext-gemini-VQA - 5CD-AI/Viet-OpenViVQA-gemini-VQA - 5CD-AI/Viet-Resume-VQA - 5CD-AI/Viet-ViTextVQA-gemini-VQA - 5CD-AI/Viet-Localization-VQA language: - vi - en library_name: transformers pipeline_tag: visual-question-answering tags: - vision ---
## Vintern-1B-v3 โ„๏ธ (Viet-InternVL2-1B-v3) - The LLaVA ๐ŸŒ‹ Challenger **What's new in Vintern-1B-v3!** - Faster performance due to using a maximum dynamic resolution of 6 tiles instead of 12 while maintaining the same quality. - Improved recognition of specific Vietnamese images because of the [5CD-AI/Viet-Localization-VQA](https://huggingface.co/datasets/5CD-AI/Viet-Localization-VQA) dataset. - Better balance between General VQA and Text/Document VQA. **We aim to:** Vietnamese soul in every token! We are excited to introduce **Vintern-1B-v3** the Vietnamese ๐Ÿ‡ป๐Ÿ‡ณ multimodal model that combines the advanced Vietnamese language model [Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct)[1] with the latest visual model, [InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px)[2], CVPR 2024. This model excels in tasks such as OCR-VQA, Doc-VQA, and Chart-VQA,... With only 1 billion parameters, it is **4096 context length** finetuned from the [InternVL2-1B](https://huggingface.co/OpenGVLab/InternVL2-1B) model on over 5 million specialized image-question-answer pairs for optical character recognition ๐Ÿ”, text recognition ๐Ÿ”ค, document extraction ๐Ÿ“‘, and general VQA. The model can be integrated into various on-device applications ๐Ÿ“ฑ, demonstrating its versatility and robust capabilities. [**\[๐Ÿค— HF Demo\]**](https://huggingface.co/spaces/khang119966/Vintern-v3-Demo) The special thing is that our model can be easily finetuned with a T4 GPU on Google Colab by following the instructions provided at the end of this section. ## Model Details | Model Name | Vision Part | Language Part | | :------------------: | :---------------------------------------------------------------------------------: | :------------------------------------------------------------------------------------------: | | Vintern-1B-v3 | [InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px) | [Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) | Vintern-1B-v3 is a multimodal large language model series, featuring models of various sizes. For each size, we release instruction-tuned models optimized for multimodal tasks. Vintern-1B-v3 consists of [InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px), an MLP projector, and [Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct). ## Training details ๐Ÿ“š The fine-tuning dataset was meticulously sampled in part from the following datasets: [Viet-OCR-VQA ๐Ÿ“š](https://huggingface.co/datasets/5CD-AI/Viet-OCR-VQA), [Viet-Doc-VQA ๐Ÿ“„](https://huggingface.co/datasets/5CD-AI/Viet-Doc-VQA), [Viet-Doc-VQA-II ๐Ÿ“‘](https://huggingface.co/datasets/5CD-AI/Viet-Doc-VQA-II), [Vista ๐Ÿ–ผ๏ธ](https://huggingface.co/datasets/Vi-VLM/Vista), [Viet-Receipt-VQA ๐Ÿงพ](https://huggingface.co/datasets/5CD-AI/Viet-Receipt-VQA), [Viet-Sketches-VQA โœ๏ธ](https://huggingface.co/datasets/5CD-AI/Viet-Sketches-VQA), [Viet-Geometry-VQA ๐Ÿ“](https://huggingface.co/datasets/5CD-AI/Viet-Geometry-VQA), [Viet-Wiki-Handwriting โœ๏ธ](https://huggingface.co/datasets/5CD-AI/Viet-Wiki-Handwriting), [Viet-ComputerScience-VQA ๐Ÿ’ป](https://huggingface.co/datasets/5CD-AI/Viet-ComputerScience-VQA), [Viet-Handwriting-gemini-VQA ๐Ÿ–‹๏ธ](https://huggingface.co/datasets/5CD-AI/Viet-Handwriting-gemini-VQA), [Viet-Menu-gemini-VQA ๐Ÿฝ๏ธ](https://huggingface.co/datasets/5CD-AI/Viet-Menu-gemini-VQA), [Viet-Vintext-gemini-VQA ๐Ÿ“œ](https://huggingface.co/datasets/5CD-AI/Viet-Vintext-gemini-VQA), [Viet-OpenViVQA-gemini-VQA ๐Ÿง ](https://huggingface.co/datasets/5CD-AI/Viet-OpenViVQA-gemini-VQA), [Viet-Resume-VQA ๐Ÿ“ƒ](https://huggingface.co/datasets/5CD-AI/Viet-Resume-VQA), [Viet-ViTextVQA-gemini-VQA ๐Ÿ“‘](https://huggingface.co/datasets/5CD-AI/Viet-ViTextVQA-gemini-VQA) and ESPECIALLY ! [**Viet-Localization-VQA ๐Ÿ‡ป๐Ÿ‡ณ**](https://huggingface.co/datasets/5CD-AI/Viet-Localization-VQA) ## Benchmarks ๐Ÿ“ˆ We are still working on more detailed benchmarks. ## Examples
## Quickstart Here provides a code snippet to show you how to load the tokenizer and model and how to generate contents. To run inference using the model, follow the steps outlined in our Colab inference notebook [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1ZD1oB56PF0lF66RCuTVJYLTEV0tM3CFf?usp=sharing) ```python import numpy as np import torch import torchvision.transforms as T # from decord import VideoReader, cpu from PIL import Image from torchvision.transforms.functional import InterpolationMode from transformers import AutoModel, AutoTokenizer IMAGENET_MEAN = (0.485, 0.456, 0.406) IMAGENET_STD = (0.229, 0.224, 0.225) def build_transform(input_size): MEAN, STD = IMAGENET_MEAN, IMAGENET_STD transform = T.Compose([ T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img), T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC), T.ToTensor(), T.Normalize(mean=MEAN, std=STD) ]) return transform def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size): best_ratio_diff = float('inf') best_ratio = (1, 1) area = width * height for ratio in target_ratios: target_aspect_ratio = ratio[0] / ratio[1] ratio_diff = abs(aspect_ratio - target_aspect_ratio) if ratio_diff < best_ratio_diff: best_ratio_diff = ratio_diff best_ratio = ratio elif ratio_diff == best_ratio_diff: if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]: best_ratio = ratio return best_ratio def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False): orig_width, orig_height = image.size aspect_ratio = orig_width / orig_height # calculate the existing image aspect ratio target_ratios = set( (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if i * j <= max_num and i * j >= min_num) target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1]) # find the closest aspect ratio to the target target_aspect_ratio = find_closest_aspect_ratio( aspect_ratio, target_ratios, orig_width, orig_height, image_size) # calculate the target width and height target_width = image_size * target_aspect_ratio[0] target_height = image_size * target_aspect_ratio[1] blocks = target_aspect_ratio[0] * target_aspect_ratio[1] # resize the image resized_img = image.resize((target_width, target_height)) processed_images = [] for i in range(blocks): box = ( (i % (target_width // image_size)) * image_size, (i // (target_width // image_size)) * image_size, ((i % (target_width // image_size)) + 1) * image_size, ((i // (target_width // image_size)) + 1) * image_size ) # split the image split_img = resized_img.crop(box) processed_images.append(split_img) assert len(processed_images) == blocks if use_thumbnail and len(processed_images) != 1: thumbnail_img = image.resize((image_size, image_size)) processed_images.append(thumbnail_img) return processed_images def load_image(image_file, input_size=448, max_num=12): image = Image.open(image_file).convert('RGB') transform = build_transform(input_size=input_size) images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num) pixel_values = [transform(image) for image in images] pixel_values = torch.stack(pixel_values) return pixel_values model = AutoModel.from_pretrained( "5CD-AI/Vintern-1B-v3", torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, trust_remote_code=True, ).eval().cuda() tokenizer = AutoTokenizer.from_pretrained("5CD-AI/Vintern-1B-v3", trust_remote_code=True, use_fast=False) test_image = 'test-image.jpg' pixel_values = load_image(test_image, max_num=6).to(torch.bfloat16).cuda() generation_config = dict(max_new_tokens= 1024, do_sample=False, num_beams = 3, repetition_penalty=2.5) question = '\nMรด tแบฃ hรฌnh แบฃnh mแป™t cรกch chi tiแบฟt.' response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True) print(f'User: {question}\nAssistant: {response}') #question = "Cรขu hแปi khรกc ......" #response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True) #print(f'User: {question}\nAssistant: {response}') ``` ## Finetune on your Data [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1bK6fpWfResjv9UxWoKHDStXQ8bop3a6Z?usp=sharing) ## Citation ``` @misc{doan2024vintern1befficientmultimodallarge, title={Vintern-1B: An Efficient Multimodal Large Language Model for Vietnamese}, author={Khang T. Doan and Bao G. Huynh and Dung T. Hoang and Thuc D. Pham and Nhat H. Pham and Quan T. M. Nguyen and Bang Q. Vo and Suong N. Hoang}, year={2024}, eprint={2408.12480}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2408.12480}, } ``` ## References [1] Yang, An, et al. "Qwen2 technical report." arXiv preprint arXiv:2407.10671 (2024). [2] Chen, Zhe, et al. "Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024. [3] Chen, Zhe, et al. "How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites." arXiv preprint arXiv:2404.16821 (2024). [4] Tran, Chi, and Huong Le Thanh. "LaVy: Vietnamese Multimodal Large Language Model." arXiv preprint arXiv:2404.07922 (2024).