Emu2-Chat / README.md

Update BibTex

20ea30b about 1 year ago

9.08 kB

	---
	language:
	- en
	---

	# Emu2-Chat

	[Paper](https://arxiv.org/abs/2312.13286) \| [🤗HF Demo](https://huggingface.co/spaces/BAAI/Emu2) \| [Demo](https://emu.ssi.plus) \| [Project Page](https://baaivision.github.io/emu2/) \| [Github](https://github.com/baaivision/Emu)


	## Model Weights

	\| Model name \| Weight \|
	\| ------------------ \| ------------------------------------------------------- \|
	\| Emu2 \| [🤗 HF link](https://huggingface.co/BAAI/Emu2) \|
	\| Emu2-Chat \| [🤗 HF link](https://huggingface.co/BAAI/Emu2-Chat) \|
	\| Emu2-Gen \| [🤗 HF link](https://huggingface.co/BAAI/Emu2-Gen) \|


	## Inference (Huggingface Version)

	#### Single GPU

	```python
	from PIL import Image
	import requests
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer


	tokenizer = AutoTokenizer.from_pretrained("BAAI/Emu2-Chat")

	model = AutoModelForCausalLM.from_pretrained(
	"BAAI/Emu2-Chat",
	torch_dtype=torch.bfloat16,
	low_cpu_mem_usage=True,
	trust_remote_code=True).to('cuda').eval()


	# `[<IMG_PLH>]` is the image placeholder which will be replaced by image embeddings.
	# the number of `[<IMG_PLH>]` should be equal to the number of input images

	query = '[<IMG_PLH>]Describe the image in details:'
	image = Image.open(requests.get('https://github.com/baaivision/Emu/Emu2/examples/blue_black_1_top_left.jpg?raw=true',stream=True).raw).convert('RGB')


	inputs = model.build_input_ids(
	text=[query],
	tokenizer=tokenizer,
	image=[image]
	)

	with torch.no_grad():
	outputs = model.generate(
	input_ids=inputs["input_ids"],
	attention_mask=inputs["attention_mask"],
	image=inputs["image"].to(torch.bfloat16),
	max_new_tokens=64,
	length_penalty=-1)

	output_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)
	```

	Interleaved image and text

	```python
	from PIL import Image
	import requests
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer


	tokenizer = AutoTokenizer.from_pretrained("BAAI/Emu2-Chat")

	model = AutoModelForCausalLM.from_pretrained(
	"BAAI/Emu2-Chat",
	torch_dtype=torch.bfloat16,
	low_cpu_mem_usage=True,
	trust_remote_code=True).to('cuda').eval()

	# `[<IMG_PLH>]` is the image placeholder which will be replaced by image embeddings.
	# the number of `[<IMG_PLH>]` should be equal to the number of input images

	query = "[<IMG_PLH>][red, white, 3, bottom left].[<IMG_PLH>][yellow, white, 2, top left].[<IMG_PLH>][green, black, 4, bottom right][<IMG_PLH>]"

	images = [
	Image.open(requests.get('https://github.com/baaivision/Emu/Emu2/examples/red_white_3_bottom_left.jpg?raw=true',stream=True).raw).convert('RGB'),
	Image.open(requests.get('https://github.com/baaivision/Emu/Emu2/examples/yellow_white_2_top_right.jpg?raw=true',stream=True).raw).convert('RGB'),
	Image.open(requests.get('https://github.com/baaivision/Emu/Emu2/examples/green_black_4_bottom_right.jpg?raw=true',stream=True).raw).convert('RGB'),
	Image.open(requests.get('https://github.com/baaivision/Emu/Emu2/examples/blue_black_1_top_left.jpg?raw=true',stream=True).raw).convert('RGB'),
	]

	inputs = model.build_input_ids(
	text=[query],
	tokenizer=tokenizer,
	image=images

	)

	with torch.no_grad():
	outputs = model.generate(
	input_ids=inputs["input_ids"],
	attention_mask=inputs["attention_mask"],
	image=inputs["image"].to(torch.bfloat16),
	max_new_tokens=64,
	length_penalty=-1)

	output_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)
	```

	#### Multi GPU


	```python
	from PIL import Image
	import requests
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from accelerate import init_empty_weights, infer_auto_device_map, load_checkpoint_and_dispatch

	tokenizer = AutoTokenizer.from_pretrained("BAAI/Emu2-Chat")

	with init_empty_weights():
	model = AutoModelForCausalLM.from_pretrained(
	"BAAI/Emu2-Chat",
	torch_dtype=torch.bfloat16,
	low_cpu_mem_usage=True,
	trust_remote_code=True)

	device_map = infer_auto_device_map(model, max_memory={0:'38GiB',1:'38GiB',}, no_split_module_classes=['Block','LlamaDecoderLayer'])
	# input and output logits should be on same device
	device_map["model.decoder.lm.lm_head"] = 0

	model = load_checkpoint_and_dispatch(
	model,
	'local/path/to/hf/version/Emu2-Chat/model',
	device_map=device_map).eval()

	# `[<IMG_PLH>]` is the image placeholder which will be replaced by image embeddings.
	# the number of `[<IMG_PLH>]` should be equal to the number of input images

	query = '[<IMG_PLH>]Describe the image in details:'
	image = Image.open(requests.get('https://github.com/baaivision/Emu/Emu2/examples/blue_black_1_top_left.jpg?raw=true',stream=True).raw).convert('RGB')

	inputs = model.build_input_ids(
	text=[query],
	tokenizer=tokenizer,
	image=[image]

	)

	with torch.no_grad():
	outputs = model.generate(
	input_ids=inputs["input_ids"],
	attention_mask=inputs["attention_mask"],
	image=inputs["image"].to(torch.bfloat16),
	max_new_tokens=64,
	length_penalty=-1)

	output_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)
	```

	Interleaved image and text

	```python
	from PIL import Image
	import requests
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from accelerate import init_empty_weights, infer_auto_device_map, load_checkpoint_and_dispatch

	tokenizer = AutoTokenizer.from_pretrained("BAAI/Emu2-Chat")

	with init_empty_weights():
	model = AutoModelForCausalLM.from_pretrained(
	"BAAI/Emu2-Chat",
	torch_dtype=torch.bfloat16,
	low_cpu_mem_usage=True,
	trust_remote_code=True)

	device_map = infer_auto_device_map(model, max_memory={0:'38GiB',1:'38GiB',}, no_split_module_classes=['Block','LlamaDecoderLayer'])
	# input and output logits should be on same device
	device_map["model.decoder.lm.lm_head"] = 0

	model = load_checkpoint_and_dispatch(
	model,
	'local/path/to/hf/version/Emu2-Chat/model',
	device_map=device_map).eval()

	# `[<IMG_PLH>]` is the image placeholder which will be replaced by image embeddings.
	# the number of `[<IMG_PLH>]` should be equal to the number of input images
	query = "[<IMG_PLH>][red, white, 3, bottom left].[<IMG_PLH>][yellow, white, 2, top left].[<IMG_PLH>][green, black, 4, bottom right][<IMG_PLH>]"

	images = [
	Image.open(requests.get('https://github.com/baaivision/Emu/Emu2/examples/red_white_3_bottom_left.jpg?raw=true',stream=True).raw).convert('RGB'),
	Image.open(requests.get('https://github.com/baaivision/Emu/Emu2/examples/yellow_white_2_top_right.jpg?raw=true',stream=True).raw).convert('RGB'),
	Image.open(requests.get('https://github.com/baaivision/Emu/Emu2/examples/green_black_4_bottom_right.jpg?raw=true',stream=True).raw).convert('RGB'),
	Image.open(requests.get('https://github.com/baaivision/Emu/Emu2/examples/blue_black_1_top_left.jpg?raw=true',stream=True).raw).convert('RGB'),
	]

	inputs = model.build_input_ids(
	text=[query],
	tokenizer=tokenizer,
	image=images

	)

	with torch.no_grad():
	outputs = model.generate(
	input_ids=inputs["input_ids"],
	attention_mask=inputs["attention_mask"],
	image=inputs["image"].to(torch.bfloat16),
	max_new_tokens=64,
	length_penalty=-1)

	output_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)
	```

	#### Quantization

	Check quantization guidance at [transformers](https://huggingface.co/docs/transformers/v4.28.0/main_classes/quantization)


	```python
	from PIL import Image
	import requests
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer


	tokenizer = AutoTokenizer.from_pretrained("BAAI/Emu2-Chat")

	model = AutoModelForCausalLM.from_pretrained(
	"BAAI/Emu2-Chat",
	load_in_4bit=True,
	trust_remote_code=True,
	bnb_4bit_compute_dtype=torch.float16).eval()

	query = '[<IMG_PLH>]Describe the image in details:'
	image = Image.open(requests.get('https://github.com/baaivision/Emu/Emu2/examples/blue_black_1_top_left.jpg?raw=true',stream=True).raw).convert('RGB')

	inputs = model.build_input_ids(
	text=[query],
	tokenizer=tokenizer,
	image=[image]

	)

	with torch.no_grad():
	outputs = model.generate(
	input_ids=inputs["input_ids"],
	attention_mask=inputs["attention_mask"],
	image=inputs["image"].to(torch.float16), # should be torch.float16
	max_new_tokens=64,
	length_penalty=-1)

	output_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)
	```

	## Citation

	If you find Emu2 useful for your research and applications, please consider starring this repository and citing:

	```
	@article{Emu2,
	title={Generative Multimodal Models are In-Context Learners},
	author={Quan Sun and Yufeng Cui and Xiaosong Zhang and Fan Zhang and Qiying Yu and Zhengxiong Luo and Yueze Wang and Yongming Rao and Jingjing Liu and Tiejun Huang and Xinlong Wang},
	publisher={arXiv preprint arXiv:2312.13286},
	year={2023},
	}
	```