jburtoft's picture
Update README.md
58d387b verified
metadata
license: apache-2.0
language:
  - en
pipeline_tag: text-generation
inference: false
tags:
  - pytorch
  - inferentia2
  - neuron

Neuronx model for upstage/SOLAR-10.7B-v1.0

This repository contains AWS Inferentia2 and neuronx compatible checkpoints for upstage/SOLAR-10.7B-v1.0. You can find detailed information about the base model on its Model Card.

This model card also includes instructions for how to compile other SOLAR models with other settings if this combination isn't what you are looking for.

This model has been exported to the neuron format using specific input_shapes and compiler parameters detailed in the paragraphs below.

It has been compiled to run on an inf2.24xlarge instance on AWS. Note that while the inf2.24xlarge has 12 cores, this compilation uses 8. This model and configuration seems to require the cores be a power of 2.

This has been compiled using version 2.16 of the Neuron SDK. Make sure your environment has version 2.16 installed. Even better is if you just compile it using the latest SDK

Please refer to the 🤗 optimum-neuron documentation for an explanation of these parameters.

Set up the environment

First, use the DLAMI image from Hugging Face. It has most of the utilities and drivers preinstalled, but hasn't been updated to 2.16 as of 1/13/24.

(As of the 20240123 version, the Hugging Face DLAMI image has the updated 2.16 binaries)

However, you will need version 2.16 to use these binaries. 2.16 shows a significant performance increase over 2.15 for Llama based models.

The commands below will update your 2.15 libraries to 2.16.

<update commands removed>

Running inference from this repository

from optimum.neuron import pipeline
p = pipeline('text-generation', 'aws-neuron/SOLAR-10.7B-v1.0-neuron')
p("Hi, my name is ",
    do_sample=True,
    top_k=10,
    temperature=0.1,
    top_p=0.95,
    num_return_sequences=1,
    max_length=200,
)

sample output:

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
2024-Jan-13 04:48:45.0857 15117:15313 [6] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2024-Jan-13 04:48:45.0857 15117:15313 [6] init.cc:137 CCOM WARN OFI plugin initNet() failed is EFA enabled?
[{'generated_text': 'Hi, my name is ***** ***** I am calling from ***** ***** and I am calling to see if you have any questions about your ***** ***** account.\nHi, my name is ***** ***** I am calling from ***** ***** and I am calling to see if you have any questions about your ***** ***** account.\nHi, my name is ***** ***** I am calling from ***** ***** and I am calling to see if you have any questions about your ***** ***** account.\nHi, my name is ***** ***** I am calling from ***** ***** and I am calling to see if you have any questions about your ***** ***** account.\nHi, my name is ***** ***** I am calling from ***** ***** and I am calling to see if'}]

Compiling for different instances or settings

If this repository doesn't have the exact version or settings, you can compile your own.

from optimum.neuron import NeuronModelForCausalLM
#num_cores should be changed based on the instance.  inf2.24xlarge has 6 neuron processors (they have two cores each) so 12 total
input_shapes = {"batch_size": 1, "sequence_length": 4096}
compiler_args = {"num_cores": 8, "auto_cast_type": 'fp16'}
model = NeuronModelForCausalLM.from_pretrained("upstage/SOLAR-10.7B-v1.0", export=True, **compiler_args, **input_shapes)
model.save_pretrained("SOLAR-10.7B-v1.0-neuron-24xlarge-2.16-8core-4096")

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("upstage/SOLAR-10.7B-v1.0")
tokenizer.save_pretrained("SOLAR-10.7B-v1.0-neuron-24xlarge-2.16-8core-4096")

This repository contains tags specific to versions of neuronx. When using with 🤗 optimum-neuron, use the repo revision specific to the version of neuronx you are using, to load the right serialized checkpoints.

Arguments passed during export

input_shapes

{
  "batch_size": 1,
  "sequence_length": 4096,
}

compiler_args

{
  "auto_cast_type": "fp16",
  "num_cores": 8,
}