metadata
language:
- en
pipeline_tag: text-generation
inference: false
tags:
- mistral
- inferentia2
- neuron
- neuronx
license: apache-2.0
Neuronx for mistralai/Mistral-7B-Instruct-v0.2 - Updated Mistral 7B Model on AWS Inferentia2 Using AWS Neuron SDK version 2.18~
This model has been exported to the neuron
format using specific input_shapes
and compiler
parameters detailed in the paragraphs below.
Please refer to the π€ optimum-neuron
documentation for an explanation of these parameters.
Note: To compile the mistralai/Mistral-7B-Instruct-v0.2 on Inf2, you need to update the model config sliding_window (either file or model variable) from null to default 4096.
Usage with π€ TGI
Refer to container image on neuronx-tgi Amazon ECR Public Gallery.
export HF_TOKEN="hf_xxx"
docker run -d -p 8080:80 \
--name mistral-7b-neuronx-tgi \
-v $(pwd)/data:/data \
--device=/dev/neuron0 \
--device=/dev/neuron1 \
--device=/dev/neuron2 \
--device=/dev/neuron3 \
--device=/dev/neuron4 \
--device=/dev/neuron5 \
--device=/dev/neuron6 \
--device=/dev/neuron7 \
--device=/dev/neuron8 \
--device=/dev/neuron9 \
--device=/dev/neuron10 \
--device=/dev/neuron11 \
-e HF_TOKEN=${HF_TOKEN} \
public.ecr.aws/shtian/neuronx-tgi:latest \
--model-id davidshtian/Mistral-7B-Instruct-v0.2-neuron-1x2048-24-cores-2.18 \
--max-batch-size 1 \
--max-input-length 16 \
--max-total-tokens 32
curl 127.0.0.1:8080/generate \
-X POST \
-d '{"inputs":"Who are you?","parameters":{"max_new_tokens":16}}' \
-H 'Content-Type: application/json'
Usage with π€ optimum-neuron pipeline
from optimum.neuron import pipeline
p = pipeline('text-generation', 'davidshtian/Mistral-7B-Instruct-v0.2-neuron-1x2048-24-cores-2.18')
p("My favorite place on earth is", max_new_tokens=64, do_sample=True, top_k=50)
[{'generated_text': "My favorite place on earth is probably Paris, France, and if I were to go there
now I would take my partner on a romantic getaway where we could lay on the grass in the park,
eat delicious French cheeses and wine, and watch the sunset on the Seine river.'"}]
Usage with π€ optimum-neuron NeuronModelForCausalLM
import torch
from transformers import AutoTokenizer
from optimum.neuron import NeuronModelForCausalLM
model = NeuronModelForCausalLM.from_pretrained("davidshtian/Mistral-7B-Instruct-v0.2-neuron-1x2048-24-cores-2.18")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
tokenizer.pad_token_id = tokenizer.eos_token_id
def model_sample(input_prompt):
input_prompt = "[INST] " + input_prompt + " [/INST]"
tokens = tokenizer(input_prompt, return_tensors="pt")
with torch.inference_mode():
sample_output = model.generate(
**tokens,
do_sample=True,
min_length=16,
max_length=32,
temperature=0.5,
pad_token_id=tokenizer.eos_token_id
)
outputs = [tokenizer.decode(tok, skip_special_tokens=True) for tok in sample_output]
res = outputs[0].split('[/INST]')[1].strip("</s>").strip()
return(res + "\n")
print(model_sample("how are you today?"))
This repository contains tags specific to versions of neuronx
. When using with π€ optimum-neuron
, use the repo revision specific to the version of neuronx
you are using, to load the right serialized checkpoints.
Arguments passed during export
input_shapes
{
"batch_size": 1,
"sequence_length": 2048,
}
compiler_args
{
"auto_cast_type": "bf16",
"num_cores": 24,
}