BobVLM βœ¨πŸ‘€

[Article on MediumπŸ“–] [Package on Githubβš™οΈ]

BobVLM is an ambitious passion project that experiments pre-training a good multimodal language model on limited resources and hardware and still achieve impressive performance. The result is a 1.5b model pre-trained on P100 GPU that is capable of detailed image description and moderate question answering.

Model Architecture πŸ”§

Bob VLM diagram

Training Approach

To maintain efficiency and accessibility:

  • Vision and language components are frozen
  • Only the adapter layer is trained
  • Supervised training approach, treating adapter training as model finetuning(Houlsby et al. (2019)'s work on MLP adapters for transfer learning)
  • Was trained on accessible hardware (T4 or P100 GPUs)

Demo

I couldn't afford the GPU prices here so it runs quite slooowww of CPU 🀧
Check out the demo here πŸ™ƒπŸ™ƒ: Demo on spaces

Installation

pip install git+https://github.com/logic-ot/BobVLM.git

or in a notebook

!pip install git+https://github.com/logic-ot/BobVLM.git

Usage

Basic Usage

from BobVLM import BobVLMProcessor, load_model, pipeline

# Load model and processor
model = load_model()
processor = BobVLMProcessor()

# Create pipeline
pipe = pipeline(model, processor)

# Example with URL image and system prompt
response = pipe(
    chat=[
        {"role": "system", "content": "You are an image understanding assistant. You can see and interpret images in fine detail"},
        {"role": "user", "content": "What's in this image?"},
    ],
    images="http://images.cocodataset.org/train2017/000000436349.jpg"
)

print(response)

Model Output

The image shows a large group of trucks parked in a parking lot, with a variety of vehicles, including semi-trucks, buses, and vans, all lined up in a neat and organized manner. The trucks are parked in a row, with some of them having their doors open, while others are closed. The vehicles are all yellow, with some having white or black stripes.<|eot_id|>'

Different Input Types

# 1. Local file
response = pipe(
    chat=[{"role": "user", "content": "Describe this image"}],
    images="path/to/your/image.jpg"
)

# 2. PIL Image
from PIL import Image
image = Image.open("your_image.jpg")
response = pipe(
    chat=[{"role": "user", "content": "What do you see?"}],
    images=image
)

Multiple Images

# You can pass multiple images
response = pipe(
    chat=[{"role": "user", "content": "Compare these images"}],
    images=["image1.jpg", "https://example.com/image2.jpg"]
)

Chat with Context

# Chat with context
messages = [
    {"role": "system", "content": "You are an expert at analyzing images in detail."},
    {"role": "user", "content": "What's in this image?"},
    {"role": "assistant", "content": "I see a dog playing in a park."},
    {"role": "user", "content": "What breed is it?"}
]

response = pipe(
    chat=messages,
    images="dog.jpg"
)

Requirements

  • Python 3.7+
  • transformers
  • torch
  • Pillow
  • requests

Model Card

For more detailed information about the model, visit the Hugging Face model page.

Citation

If you use BobVLM in your research, please cite:

@misc{bobvlm2024,
  author = {selfDotOsman},
  title = {BobVLM: A Lightweight Vision Language Model with Efficient Adapter Architecture},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/selfDotOsman/BobVLM-1.5b}}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Downloads last month
117
Safetensors
Model size
1.54B params
Tensor type
F32
Β·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.

Model tree for selfDotOsman/BobVLM-1.5b

Finetuned
(254)
this model

Datasets used to train selfDotOsman/BobVLM-1.5b

Space using selfDotOsman/BobVLM-1.5b 1