--- license: apache-2.0 datasets: - liuhaotian/LLaVA-CC3M-Pretrain-595K base_model: - Qwen/Qwen2.5-0.5B - openai/clip-vit-large-patch14-336 --- # Visual Language Model Based on Qwen and CLIP This is a visual language multimodal model built upon the Qwen series language models and the CLIP visual encoder. It has been trained for 10 epochs on the LLaVA pre-training dataset and nearly 800K examples (150K instruction fine-tuning and 665K instruction mixed fine-tuning). However, due to data size is larger than model, so it can only perform simple question-answering tasks on images and currently supports only English question answering. ## Training Details - The model utilizes the visual encoder from `openai/clip-vit-base-patch32` combined with `qwen2.5-0.5B` as the language model, using a Multi-Layer Perceptron (MLP) layer for alignment. The alignment layer was trained separately for four epochs on the pre-training dataset, but no significant loss improvement was observed after the second epoch. - It was trained for three epochs on the 150K LLaVA instruction fine-tuning dataset, with a token length of 1024 in the first epoch and 2048 in the second and third epochs. The visual encoder was frozen during training, allowing for the training of the alignment layer and the language model. - Finally, it underwent three epochs of training on the 665K LLaVA instruction dataset, maintaining a consistent token length of 2048 across all epochs, similar to the setup for the 150K instruction fine-tuning dataset. The visual encoder remained frozen throughout these epochs. - Model hallucinations still exist, as such a small model finds it challenging to overfit on a large dataset. Therefore, its answer accuracy cannot be compared to that of the full LLaVA model. However, as a small visual language model trained from scratch, it demonstrates the powerful multimodal learning capability of transformers in visual language interactions. I will publish all of my training code and model files for researchers interested in visual language models. ### Training Resource Consumption - Training consumed resources: H20*1*67h (for reference only). ### Uploading Issues I attempted to use Hugging Face's PyTorch classes for uploading, but I found that it did not adequately record all of my weights, leading to issues during model inference. Therefore, it is recommended to load the model using PyTorch. If you do not have an image, you can download one from the repository; it is a small bird with red and black feathers. for example at this: ![a small bird with red and black](./bird.jpeg) ### Loading Instructions Below are the steps to load the model using PyTorch: 1. Download the `qwenva.py` file and the `qwenva.pth` weights from the repository, ensuring that both the weight and model architecture files are in the same directory. 2. Import the model and processor from the `qwenva` file: ```python from qwenva import model, processor from PIL import Image import torch device = torch.device("cuda" if torch.cuda.is_available() else "cpu") image = Image.open("./bird.jpeg") input_ = processor("please describe the image", image) input_ = {k: v.to(device) for k, v in input_.items()} model.to(device) image_idx = torch.tensor(input_['input_ids'].shape[1] - 1).unsqueeze(0) generated_ids = model.generate( **input_, max_length=512, ) generated_ids = generated_ids[0][input_['input_ids'].size(1):] response = processor.tokenizer.decode(generated_ids, skip_special_tokens=True) print(response) "The image features a small bird, possibly a cockatoo or a cockatoo, sitting on a branch in a forest. The bird appears to be looking up, possibly observing its surroundings or scanning for potential prey. The bird is surrounded by leaves and flowers on the branches, adding a sense of natural beauty to the scene. The image captures the beauty of the bird and its environment, highlighting the harmony between nature and human-made objects."