Update README.md

0d86b2e verified about 1 month ago

5.59 kB

	---
	license: llama3.1
	language:
	- en
	pipeline_tag: image-text-to-text
	tags:
	- text-generation-inference
	---

	# Dragonfly Model Card

	Note: Users are permitted to use this model in accordance with the Llama 3.1 Community License Agreement.

	## Model Details

	Dragonfly is a multimodal visual-language model, trained by instruction tuning on Llama 3.1.

	- Developed by: [Together AI](https://www.together.ai/)
	- Model type: An autoregressive visual-language model based on the transformer architecture
	- License: [Llama 3.1 Community License Agreement](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE)
	- Finetuned from model: [Llama 3.1](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)

	### Model Sources

	- Repository: https://github.com/togethercomputer/Dragonfly
	- Paper: https://arxiv.org/abs/2406.00977

	## Uses

	The primary use of Dragonfly is research on large visual-language models.
	It is primarily intended for researchers and hobbyists in natural language processing, machine learning, and artificial intelligence.


	## How to Get Started with the Model

	### 💿 Installation

	Create a conda environment and install necessary packages
	```bash
	conda env create -f environment.yml
	conda activate dragonfly_env
	```

	Install flash attention
	```bash
	pip install flash-attn --no-build-isolation
	```

	As a final step, please run the following command.
	```bash
	pip install --upgrade -e .
	```

	### 🧠 Inference

	If you have successfully completed the installation process, then you should be able to follow the steps below.

	Question: What is so funny about this image?

	![Monalisa Dog](monalisa_dog.jpg)

	Load necessary packages
	```python
	import torch
	from PIL import Image
	from transformers import AutoProcessor, AutoTokenizer

	from dragonfly.models.modeling_dragonfly import DragonflyForCausalLM
	from dragonfly.models.processing_dragonfly import DragonflyProcessor
	from pipeline.train.train_utils import random_seed
	```

	Instantiate the tokenizer, processor, and model.
	```python
	device = torch.device("cuda:0")

	tokenizer = AutoTokenizer.from_pretrained("togethercomputer/Llama-3.1-8B-Dragonfly-v2")
	clip_processor = AutoProcessor.from_pretrained("openai/clip-vit-large-patch14-336")
	image_processor = clip_processor.image_processor
	processor = DragonflyProcessor(image_processor=image_processor, tokenizer=tokenizer, image_encoding_style="llava-hd")

	model = DragonflyForCausalLM.from_pretrained("togethercomputer/Llama-3.1-8B-Dragonfly-v2")
	model = model.to(torch.bfloat16)
	model = model.to(device)
	```

	Now, lets load the image and process them.
	```python
	image = Image.open("./test_images/skateboard.png")
	image = image.convert("RGB")
	images = [image]
	# images = [None] # if you do not want to pass any images

	text_prompt = "<\|start_header_id\|>user<\|end_header_id\|>\n\nWhat is so funny about this image?<\|eot_id\|><\|start_header_id\|>assistant<\|end_header_id\|>\n\n"

	inputs = processor(text=[text_prompt], images=images, max_length=4096, return_tensors="pt", is_generate=True)
	inputs = inputs.to(device)
	```

	Finally, let us generate the responses from the model
	```python
	temperature = 0

	with torch.inference_mode():
	generation_output = model.generate(**inputs, max_new_tokens=1024, eos_token_id=tokenizer.encode("<\|eot_id\|>"), do_sample=temperature > 0, temperature=temperature, use_cache=True)

	generation_text = processor.batch_decode(generation_output, skip_special_tokens=False)
	```

	An example response.
	```plaintext
	The humor in this image comes from the surreal juxtaposition of a dog's face with the body of the Mona Lisa, a famous painting by Leonardo da Vinci.
	The Mona Lisa is known for her enigmatic smile and is often considered one of the most famous paintings in the world. By combining the dog's face with
	the body of the Mona Lisa, the artist has created a whimsical and amusing image that plays on the viewer 's expectations and familiarity with the
	original paintings. The contrast between the dog's natural, expressive features and the serene, mysterious expression of the Mona Lisa creates a
	humerous effect that is likely to elicit laughter<\|eot_id\|>
	```

	## Training Details

	See more details in the "Implementation" section of our [paper](https://arxiv.org/abs/2406.00977).

	## Evaluation

	See more details in the "Results" section of our [paper](https://arxiv.org/abs/2406.00977).

	## 🏆 Credits

	We would like to acknowledge the following resources that were instrumental in the development of Dragonfly:

	- [Meta Llama 3.1](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct): We utilized the Llama 3 model as our foundational language model.
	- [CLIP](https://huggingface.co/openai/clip-vit-base-patch32): Our vision backbone is CLIP model from OpenAI.
	- Our codebase is built upon the following two codebases:
	- [Otter: A Multi-Modal Model with In-Context Instruction Tuning](https://github.com/Luodian/Otter)
	- [LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images](https://github.com/thunlp/LLaVA-UHD)

	## 📚 BibTeX

	```bibtex
	@misc{thapa2024dragonfly,
	title={Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models},
	author={Rahul Thapa and Kezhen Chen and Ian Covert and Rahul Chalamala and Ben Athiwaratkun and Shuaiwen Leon Song and James Zou},
	year={2024},
	eprint={2406.00977},
	archivePrefix={arXiv},
	primaryClass={cs.CV}
	}
	```

	## Model Card Authors
	Rahul Thapa, Kezhen Chen, Rahul Chalamala

	## Model Card Contact
	Rahul Thapa (rahulthapa@together.ai), Kezhen Chen (kezhen@together.ai)