--- language: - en pipeline_tag: text-generation inference: false tags: - mistral - inferentia2 - neuron - neuronx license: apache-2.0 --- # Neuronx for [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) - Updated Mistral 7B Model on [AWS Inferentia2](https://aws.amazon.com/ec2/instance-types/inf2/) Using AWS Neuron SDK version 2.18~ This model has been exported to the `neuron` format using specific `input_shapes` and `compiler` parameters detailed in the paragraphs below. Please refer to the 🤗 `optimum-neuron` [documentation](https://huggingface.co/docs/optimum-neuron/main/en/guides/models#configuring-the-export-of-a-generative-model) for an explanation of these parameters. Note: To compile the mistralai/Mistral-7B-Instruct-v0.2 on Inf2, you need to update the model config sliding_window (either file or model variable) from null to default 4096. ## Usage with 🤗 `TGI` Refer to container image on [neuronx-tgi](https://gallery.ecr.aws/shtian/neuronx-tgi) Amazon ECR Public Gallery. ```shell export HF_TOKEN="hf_xxx" docker run -d -p 8080:80 \ --name mistral-7b-neuronx-tgi \ -v $(pwd)/data:/data \ --device=/dev/neuron0 \ --device=/dev/neuron1 \ --device=/dev/neuron2 \ --device=/dev/neuron3 \ --device=/dev/neuron4 \ --device=/dev/neuron5 \ --device=/dev/neuron6 \ --device=/dev/neuron7 \ --device=/dev/neuron8 \ --device=/dev/neuron9 \ --device=/dev/neuron10 \ --device=/dev/neuron11 \ -e HF_TOKEN=${HF_TOKEN} \ public.ecr.aws/shtian/neuronx-tgi:latest \ --model-id davidshtian/Mistral-7B-Instruct-v0.2-neuron-4x2048-24-cores-2.18 \ --max-batch-size 4 \ --max-input-length 16 \ --max-total-tokens 32 ``` There seems no support for sending list of prompts to server, refer to this [GitHub issue](https://github.com/huggingface/text-generation-inference/issues/1008). ```python from huggingface_hub import InferenceClient import concurrent client = InferenceClient(model="http://127.0.0.1:8080") batch_text = ["1+1=", "2+2=", "3+3=", "4+4="] bs = 4 def format_text_list(text_list): return ['[INST] ' + text + ' [/INST]' for text in text_list] def gen_text(text): return client.text_generation(text, max_new_tokens=16) with concurrent.futures.ThreadPoolExecutor(max_workers=bs) as executor: out = list(executor.map(gen_text, format_text_list(batch_text))) print(out) ``` ## Usage with 🤗 `optimum-neuron pipeline` ```python from optimum.neuron import pipeline p = pipeline('text-generation', 'davidshtian/Mistral-7B-Instruct-v0.2-neuron-4x2048-24-cores-2.18') p("My favorite place on earth is", max_new_tokens=64, do_sample=True, top_k=50) [{'generated_text': "My favorite place on earth is probably Paris, France, and if I were to go there now I would take my partner on a romantic getaway where we could lay on the grass in the park, eat delicious French cheeses and wine, and watch the sunset on the Seine river.'"}] ``` ## Usage with 🤗 `optimum-neuron NeuronModelForCausalLM` ```python import torch from transformers import AutoTokenizer from optimum.neuron import NeuronModelForCausalLM model = NeuronModelForCausalLM.from_pretrained("davidshtian/Mistral-7B-Instruct-v0.2-neuron-4x2048-24-cores-2.18") tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2") tokenizer.pad_token_id = tokenizer.eos_token_id def model_sample(input_prompt): input_prompt = "[INST] " + input_prompt + " [/INST]" tokens = tokenizer(input_prompt, return_tensors="pt") with torch.inference_mode(): sample_output = model.generate( **tokens, do_sample=True, min_length=16, max_length=32, temperature=0.5, pad_token_id=tokenizer.eos_token_id ) outputs = [tokenizer.decode(tok, skip_special_tokens=True) for tok in sample_output] res = outputs[0].split('[/INST]')[1].strip("").strip() return(res + "\n") print(model_sample("how are you today?")) ``` This repository contains tags specific to versions of `neuronx`. When using with 🤗 `optimum-neuron`, use the repo revision specific to the version of `neuronx` you are using, to load the right serialized checkpoints. ## Arguments passed during export **input_shapes** ```json { "batch_size": 4, "sequence_length": 2048, } ``` **compiler_args** ```json { "auto_cast_type": "bf16", "num_cores": 24, } ```