Navyabhat
/

Llava-Phi2

Visual Question Answering

text-generation

Inference Endpoints

Model card Files Files and versions Community

Model Card for Model ID

This is a multimodal implementation of Phi2 model inspired by LlaVA-Phi.

Model Details

LLM Backbone: Phi2
Vision Tower: clip-vit-large-patch14-336
Pretraining Dataset: LAION-CC-SBU dataset with BLIP captions(200k samples)
Finetuning Dataset: Instruct 150k dataset based on COCO
Finetuned Model: Navyabhat/Llava-Phi2

Model Sources

Original Repository: Llava-Phi
Paper [optional]: LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model
Demo [optional]: Demo Link

How to Get Started with the Model

Use the code below to get started with the model.

Clone this repository and navigate to llava-phi folder

git clone https://github.com/zhuyiche/llava-phi.git
cd llava-phi

Install Package

conda create -n llava_phi python=3.10 -y
conda activate llava_phi
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Run the Model

python llava_phi/eval/run_llava_phi.py --model-path="RaviNaik/Llava-Phi2" \
    --image-file="https://huggingface.co/Navyabhat/Llava-Phi2/resolve/main/people.jpg?download=true" \
    --query="How many people are there in the image?"

Acknowledgement

This implementation is based on wonderful work done by:
LlaVA-Phi
Llava
Phi2

Downloads last month: 16

Safetensors

Model size

3.09B params

Tensor type

F32

·

Inference Examples

Visual Question Answering

This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Datasets used to train Navyabhat/Llava-Phi2

Space using Navyabhat/Llava-Phi2 1