metadata
license: mit
datasets:
- liuhaotian/LLaVA-Instruct-150K
- liuhaotian/LLaVA-Pretrain
language:
- en
pipeline_tag: visual-question-answering
Model Card for Model ID
This is a multimodal implementation of Phi2 model inspired by LlaVA-Phi.
Model Details
- LLM Backbone: Phi2
- Vision Tower: clip-vit-large-patch14-336
- Pretraining Dataset: LAION-CC-SBU dataset with BLIP captions(200k samples)
- Finetuning Dataset: Instruct 150k dataset based on COCO
- Finetuned Model: RaviNaik/Llava-Phi2
Model Sources
- Original Repository: Llava-Phi
- Paper [optional]: LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model
- Demo [optional]: Demo Link
How to Get Started with the Model
Use the code below to get started with the model.
- Clone this repository and navigate to llava-phi folder
git clone https://github.com/zhuyiche/llava-phi.git
cd llava-phi
- Install Package
conda create -n llava_phi python=3.10 -y
conda activate llava_phi
pip install --upgrade pip # enable PEP 660 support
pip install -e .
- Run the Model
python llava_phi/eval/run_llava_phi.py --model-path="RaviNaik/Llava-Phi2" \
--image-file="https://huggingface.co/RaviNaik/Llava-Phi2/resolve/main/people.jpg?download=true" \
--query="How many people are there in the image?"
Acknowledgement
This implementation is based on wonderful work done by:
LlaVA-Phi
Llava
Phi2