|
--- |
|
license: llama3.1 |
|
language: |
|
- en |
|
pipeline_tag: image-text-to-text |
|
tags: |
|
- text-generation-inference |
|
--- |
|
|
|
# Dragonfly Model Card |
|
|
|
**Note: Users are permitted to use this model in accordance with the Llama 3.1 Community License Agreement.** |
|
|
|
## Model Details |
|
|
|
Dragonfly is a multimodal visual-language model, trained by instruction tuning on Llama 3.1. |
|
|
|
- **Developed by:** [Together AI](https://www.together.ai/) |
|
- **Model type:** An autoregressive visual-language model based on the transformer architecture |
|
- **License:** [Llama 3.1 Community License Agreement](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE) |
|
- **Finetuned from model:** [Llama 3.1](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) |
|
|
|
### Model Sources |
|
|
|
- **Repository:** https://github.com/togethercomputer/Dragonfly |
|
- **Paper:** https://arxiv.org/abs/2406.00977 |
|
|
|
## Uses |
|
|
|
The primary use of Dragonfly is research on large visual-language models. |
|
It is primarily intended for researchers and hobbyists in natural language processing, machine learning, and artificial intelligence. |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
### πΏ Installation |
|
|
|
Create a conda environment and install necessary packages |
|
```bash |
|
conda env create -f environment.yml |
|
conda activate dragonfly_env |
|
``` |
|
|
|
Install flash attention |
|
```bash |
|
pip install flash-attn --no-build-isolation |
|
``` |
|
|
|
As a final step, please run the following command. |
|
```bash |
|
pip install --upgrade -e . |
|
``` |
|
|
|
### π§ Inference |
|
|
|
If you have successfully completed the installation process, then you should be able to follow the steps below. |
|
|
|
Question: What is so funny about this image? |
|
|
|
![Monalisa Dog](monalisa_dog.jpg) |
|
|
|
Load necessary packages |
|
```python |
|
import torch |
|
from PIL import Image |
|
from transformers import AutoProcessor, AutoTokenizer |
|
|
|
from dragonfly.models.modeling_dragonfly import DragonflyForCausalLM |
|
from dragonfly.models.processing_dragonfly import DragonflyProcessor |
|
from pipeline.train.train_utils import random_seed |
|
``` |
|
|
|
Instantiate the tokenizer, processor, and model. |
|
```python |
|
device = torch.device("cuda:0") |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("togethercomputer/Llama-3.1-8B-Dragonfly-v2") |
|
clip_processor = AutoProcessor.from_pretrained("openai/clip-vit-large-patch14-336") |
|
image_processor = clip_processor.image_processor |
|
processor = DragonflyProcessor(image_processor=image_processor, tokenizer=tokenizer, image_encoding_style="llava-hd") |
|
|
|
model = DragonflyForCausalLM.from_pretrained("togethercomputer/Llama-3.1-8B-Dragonfly-v2") |
|
model = model.to(torch.bfloat16) |
|
model = model.to(device) |
|
``` |
|
|
|
Now, lets load the image and process them. |
|
```python |
|
image = Image.open("./test_images/skateboard.png") |
|
image = image.convert("RGB") |
|
images = [image] |
|
# images = [None] # if you do not want to pass any images |
|
|
|
text_prompt = "<|start_header_id|>user<|end_header_id|>\n\nWhat is so funny about this image?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" |
|
|
|
inputs = processor(text=[text_prompt], images=images, max_length=4096, return_tensors="pt", is_generate=True) |
|
inputs = inputs.to(device) |
|
``` |
|
|
|
Finally, let us generate the responses from the model |
|
```python |
|
temperature = 0 |
|
|
|
with torch.inference_mode(): |
|
generation_output = model.generate(**inputs, max_new_tokens=1024, eos_token_id=tokenizer.encode("<|eot_id|>"), do_sample=temperature > 0, temperature=temperature, use_cache=True) |
|
|
|
generation_text = processor.batch_decode(generation_output, skip_special_tokens=False) |
|
``` |
|
|
|
An example response. |
|
```plaintext |
|
The humor in this image comes from the surreal juxtaposition of a dog's face with the body of the Mona Lisa, a famous painting by Leonardo da Vinci. |
|
The Mona Lisa is known for her enigmatic smile and is often considered one of the most famous paintings in the world. By combining the dog's face with |
|
the body of the Mona Lisa, the artist has created a whimsical and amusing image that plays on the viewer 's expectations and familiarity with the |
|
original paintings. The contrast between the dog's natural, expressive features and the serene, mysterious expression of the Mona Lisa creates a |
|
humerous effect that is likely to elicit laughter<|eot_id|> |
|
``` |
|
|
|
## Training Details |
|
|
|
See more details in the "Implementation" section of our [paper](https://arxiv.org/abs/2406.00977). |
|
|
|
## Evaluation |
|
|
|
See more details in the "Results" section of our [paper](https://arxiv.org/abs/2406.00977). |
|
|
|
## π Credits |
|
|
|
We would like to acknowledge the following resources that were instrumental in the development of Dragonfly: |
|
|
|
- [Meta Llama 3.1](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct): We utilized the Llama 3 model as our foundational language model. |
|
- [CLIP](https://huggingface.co/openai/clip-vit-base-patch32): Our vision backbone is CLIP model from OpenAI. |
|
- Our codebase is built upon the following two codebases: |
|
- [Otter: A Multi-Modal Model with In-Context Instruction Tuning](https://github.com/Luodian/Otter) |
|
- [LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images](https://github.com/thunlp/LLaVA-UHD) |
|
|
|
## π BibTeX |
|
|
|
```bibtex |
|
@misc{thapa2024dragonfly, |
|
title={Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models}, |
|
author={Rahul Thapa and Kezhen Chen and Ian Covert and Rahul Chalamala and Ben Athiwaratkun and Shuaiwen Leon Song and James Zou}, |
|
year={2024}, |
|
eprint={2406.00977}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CV} |
|
} |
|
``` |
|
|
|
## Model Card Authors |
|
Rahul Thapa, Kezhen Chen, Rahul Chalamala |
|
|
|
## Model Card Contact |
|
Rahul Thapa (rahulthapa@together.ai), Kezhen Chen (kezhen@together.ai) |