visheratin
/

MC-LLaVA-3b

Inference Endpoints

Model card Files Files and versions Community

MC-LLaVA-3b / README.md

visheratin's picture

Create README.md

7190045 12 months ago

|

3.49 kB

	---
	datasets:
	- liuhaotian/LLaVA-Pretrain
	- liuhaotian/LLaVA-Instruct-150K
	language:
	- en
	tags:
	- llava
	- phi
	---

	# LLaVA-3b Model Card

	## Model details

	LLaVA-3b is a model fine-tuned from [Dolphin 2.6 Phi](https://huggingface.co/cognitivecomputations/dolphin-2_6-phi-2) in a LLaVA fashion using vision tower from
	[SigLIP 400M](https://huggingface.co/timm/ViT-SO400M-14-SigLIP-384). There are a couple of things different from the original LLaVA architecture:

	1. Multiple image tokens. The multimodal projector generates embeddings of shape [5, 2560] instead of [1, 2560] for images. The idea is that using more tokens
	allows to get more info from the image into the language model.
	2. The model uses the output from the latest layer of the vision encoder instead of intermediate one.

	As Dolphin 2.6 Phi, LLaVA-3b uses ChatML prompt format:

	```
	<\|im_start\|>system
	You are Dolphin, a helpful AI assistant.<\|im_end\|>
	<\|im_start\|>user
	{prompt}<\|im_end\|>
	<\|im_start\|>assistant
	```

	## How to use

	Install dependencies

	```
	!pip install -q open_clip_torch timm einops
	```

	Download modeling files

	```
	from huggingface_hub import hf_hub_download

	hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="configuration_llava.py", local_dir="./", force_download=True)
	hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="configuration_phi.py", local_dir="./", force_download=True)
	hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="modeling_llava.py", local_dir="./", force_download=True)
	hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="modeling_phi.py", local_dir="./", force_download=True)
	hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="processing_llava.py", local_dir="./", force_download=True)
	```

	Create a model

	```
	from modeling_llava import LlavaForConditionalGeneration
	import torch

	model = LlavaForConditionalGeneration.from_pretrained("visheratin/LLaVA-3b", torch_dtype=torch.float16)
	model = model.to("cuda")
	```

	Create processors

	```
	from transformers import AutoTokenizer
	from processing_llava import LlavaProcessor, OpenCLIPImageProcessor

	tokenizer = AutoTokenizer.from_pretrained("visheratin/LLaVA-3b")
	image_processor = OpenCLIPImageProcessor(model.config.preprocess_config)
	processor = LlavaProcessor(image_processor, tokenizer)
	```

	Set image and text

	```
	from PIL import Image
	import requests

	image_file = "https://images.unsplash.com/photo-1439246854758-f686a415d9da"
	raw_image = Image.open(requests.get(image_file, stream=True).raw)

	prompt = """<\|im_start\|>system
	A chat between a curious human and an artificial intelligence assistant.
	The assistant gives helpful, detailed, and polite answers to the human's questions.
	The assistant does not hallucinate and pays very close attention to the details.<\|im_end\|>
	<\|im_start\|>user
	<image>
	Describe the image.<\|im_end\|>
	<\|im_start\|>assistant
	"""
	```

	Process inputs

	```
	inputs = processor(prompt, raw_image, model, return_tensors='pt')

	inputs['input_ids'] = inputs['input_ids'].to(model.device)
	inputs['attention_mask'] = inputs['attention_mask'].to(model.device)
	```

	Generate the data

	```
	output = model.generate(**inputs, max_new_tokens=200, do_sample=True, top_p=0.5, temperature=1.2, eos_token_id=tokenizer.eos_token_id)
	```

	## License
	This model is based on Phi-2 and is governed by Microsoft's microsoft-research-license which prohibits commercial use.

	Where to send questions or comments about the model:
	https://twitter.com/visheratin