metadata

license: apache-2.0
datasets:
  - liuhaotian/LLaVA-CC3M-Pretrain-595K
base_model:
  - Qwen/Qwen2.5-0.5B
  - openai/clip-vit-large-patch14-336

Visual Language Model Based on Qwen and CLIP

This is a visual language multimodal model built upon the Qwen series language models and the CLIP visual encoder. It has been trained for 10 epochs on the LLaVA pre-training dataset and nearly 800K examples (150K instruction fine-tuning and 665K instruction mixed fine-tuning). However, due to data size is larger than model, so it can only perform simple question-answering tasks on images and currently supports only English question answering.

Training Details

The model utilizes the visual encoder from openai/clip-vit-base-patch32 combined with qwen2.5-0.5B as the language model, using a Multi-Layer Perceptron (MLP) layer for alignment. The alignment layer was trained separately for four epochs on the pre-training dataset, but no significant loss improvement was observed after the second epoch.
It was trained for three epochs on the 150K LLaVA instruction fine-tuning dataset, with a token length of 1024 in the first epoch and 2048 in the second and third epochs. The visual encoder was frozen during training, allowing for the training of the alignment layer and the language model.
Finally, it underwent three epochs of training on the 665K LLaVA instruction dataset, maintaining a consistent token length of 2048 across all epochs, similar to the setup for the 150K instruction fine-tuning dataset. The visual encoder remained frozen throughout these epochs.
Model hallucinations still exist, as such a small model finds it challenging to overfit on a large dataset. Therefore, its answer accuracy cannot be compared to that of the full LLaVA model. However, as a small visual language model trained from scratch, it demonstrates the powerful multimodal learning capability of transformers in visual language interactions. I will publish all of my training code and model files for researchers interested in visual language models.

Training Resource Consumption

Training consumed resources: H20167h (for reference only).

Uploading Issues

I attempted to use Hugging Face's PyTorch classes for uploading, but I found that it did not adequately record all of my weights, leading to issues during model inference. Therefore, it is recommended to load the model using PyTorch.

If you do not have an image, you can download one from the repository; it is a small bird with red and black feathers.

for example at this:

Loading Instructions

Below are the steps to load the model using PyTorch:

Download the qwenva.py file and the qwenva.pth weights from the repository, ensuring that both the weight and model architecture files are in the same directory.
Import the model and processor from the qwenva file:

from qwenva import model, processor
from PIL import Image
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
image = Image.open("./bird.jpeg")
input_ = processor("please describe the image", image)
input_ = {k: v.to(device) for k, v in input_.items()}
model.to(device)
image_idx = torch.tensor(input_['input_ids'].shape[1] - 1).unsqueeze(0)
generated_ids = model.generate(
    **input_,
    max_length=512,
)
generated_ids = generated_ids[0][input_['input_ids'].size(1):]
response = processor.tokenizer.decode(generated_ids, skip_special_tokens=True)
print(response)


"The image features a small bird, possibly a cockatoo or a cockatoo, sitting on a branch in a forest. The bird appears to be looking up, possibly observing its surroundings or scanning for potential prey. The bird is surrounded by leaves and flowers on the branches, adding a sense of natural beauty to the scene. The image captures the beauty of the bird and its environment, highlighting the harmony between nature and human-made objects."