Update README.md

268ce5b verified 7 months ago

4.01 kB

	---
	language:
	- en
	pipeline_tag: text-generation
	inference: false
	tags:
	- mistral
	- inferentia2
	- neuron
	- neuronx
	license: apache-2.0
	---
	# Neuronx for [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) - Updated Mistral 7B Model on [AWS Inferentia2](https://aws.amazon.com/ec2/instance-types/inf2/) Using AWS Neuron SDK version 2.18~

	This model has been exported to the `neuron` format using specific `input_shapes` and `compiler` parameters detailed in the paragraphs below.

	Please refer to the 🤗 `optimum-neuron` [documentation](https://huggingface.co/docs/optimum-neuron/main/en/guides/models#configuring-the-export-of-a-generative-model) for an explanation of these parameters.

	Note: To compile the mistralai/Mistral-7B-Instruct-v0.2 on Inf2, you need to update the model config sliding_window (either file or model variable) from null to default 4096.

	## Usage with 🤗 `TGI`
	Refer to container image on [neuronx-tgi](https://gallery.ecr.aws/shtian/neuronx-tgi) Amazon ECR Public Gallery.
	```shell
	export HF_TOKEN="hf_xxx"

	docker run -d -p 8080:80 \
	--name mistral-7b-neuronx-tgi \
	-v $(pwd)/data:/data \
	--device=/dev/neuron0 \
	--device=/dev/neuron1 \
	--device=/dev/neuron2 \
	--device=/dev/neuron3 \
	--device=/dev/neuron4 \
	--device=/dev/neuron5 \
	--device=/dev/neuron6 \
	--device=/dev/neuron7 \
	--device=/dev/neuron8 \
	--device=/dev/neuron9 \
	--device=/dev/neuron10 \
	--device=/dev/neuron11 \
	-e HF_TOKEN=${HF_TOKEN} \
	public.ecr.aws/shtian/neuronx-tgi:latest \
	--model-id davidshtian/Mistral-7B-Instruct-v0.2-neuron-1x2048-24-cores-2.18 \
	--max-batch-size 1 \
	--max-input-length 16 \
	--max-total-tokens 32

	curl 127.0.0.1:8080/generate \
	-X POST \
	-d '{"inputs":"Who are you?","parameters":{"max_new_tokens":16}}' \
	-H 'Content-Type: application/json'
	```

	## Usage with 🤗 `optimum-neuron pipeline`

	```python
	from optimum.neuron import pipeline

	p = pipeline('text-generation', 'davidshtian/Mistral-7B-Instruct-v0.2-neuron-1x2048-24-cores-2.18')
	p("My favorite place on earth is", max_new_tokens=64, do_sample=True, top_k=50)

	[{'generated_text': "My favorite place on earth is probably Paris, France, and if I were to go there
	now I would take my partner on a romantic getaway where we could lay on the grass in the park,
	eat delicious French cheeses and wine, and watch the sunset on the Seine river.'"}]
	```

	## Usage with 🤗 `optimum-neuron NeuronModelForCausalLM`

	```python
	import torch
	from transformers import AutoTokenizer
	from optimum.neuron import NeuronModelForCausalLM

	model = NeuronModelForCausalLM.from_pretrained("davidshtian/Mistral-7B-Instruct-v0.2-neuron-1x2048-24-cores-2.18")

	tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
	tokenizer.pad_token_id = tokenizer.eos_token_id

	def model_sample(input_prompt):
	input_prompt = "[INST] " + input_prompt + " [/INST]"

	tokens = tokenizer(input_prompt, return_tensors="pt")

	with torch.inference_mode():
	sample_output = model.generate(
	**tokens,
	do_sample=True,
	min_length=16,
	max_length=32,
	temperature=0.5,
	pad_token_id=tokenizer.eos_token_id
	)
	outputs = [tokenizer.decode(tok, skip_special_tokens=True) for tok in sample_output]

	res = outputs[0].split('[/INST]')[1].strip("</s>").strip()
	return(res + "\n")

	print(model_sample("how are you today?"))
	```

	This repository contains tags specific to versions of `neuronx`. When using with 🤗 `optimum-neuron`, use the repo revision specific to the version of `neuronx` you are using, to load the right serialized checkpoints.

	## Arguments passed during export

	input_shapes

	```json
	{
	"batch_size": 1,
	"sequence_length": 2048,
	}
	```

	compiler_args

	```json
	{
	"auto_cast_type": "bf16",
	"num_cores": 24,
	}
	```