--- license: apache-2.0 --- **Model Name: Qwen2 orca_mini_v7_7b-AWQ** orca_mini_v7_7b-AWQ is AWQ Quantize version of orca_mini_v7_7b model. "Obsessed with GenAI's potential? So am I ! Let's create together 🚀 https://www.linkedin.com/in/pankajam"
### Example Usage Here is the ChatML prompt format ``` <|im_start|>system You are Orca Mini, a helpful AI assistant.<|im_end|> <|im_start|>user Hello Orca Mini, what can you do for me?<|im_end|> <|im_start|>assistant ``` Below shows a code example on how to use this model ```python from transformers import AutoModel, AutoTokenizer model_slug = "pankajmathur/orca_mini_v7_7b-AWQ" model = AutoModel.from_pretrained(model_slug) tokenizer = AutoTokenizer.from_pretrained(model_slug) messages = [ {"role": "system", "content": "You are Orca Mini, a helpful AI assistant."}, {"role": "user", "content": "Hello Orca Mini, what can you do for me?"} ] gen_input = tokenizer.apply_chat_template(messages, return_tensors="pt") model.generate(**gen_input) ``` ### Processing Long Texts (Based upon Qwen2-7B-Instruct suggestions at https://huggingface.co/Qwen/Qwen2-7B-Instruct) To handle extensive inputs exceeding 32,768 tokens, we utilize [YARN](https://arxiv.org/abs/2309.00071), a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts. For deployment, we recommend using vLLM. You can enable the long-context capabilities by following these steps: 1. **Install vLLM**: You can install vLLM by running the following command. ```bash pip install "vllm>=0.4.3" ``` Or you can install vLLM from [source](https://github.com/vllm-project/vllm/). 2. **Configure Model Settings**: After downloading the model weights, modify the `config.json` file by including the below snippet: ```json { "architectures": [ "Qwen2ForCausalLM" ], // ... "vocab_size": 152064, // adding the following snippets "rope_scaling": { "factor": 4.0, "original_max_position_embeddings": 32768, "type": "yarn" } } ``` This snippet enable YARN to support longer contexts. 3. **Model Deployment**: Utilize vLLM to deploy your model. For instance, you can set up an openAI-like server using the command: ```bash python -u -m vllm.entrypoints.openai.api_server --model pankajmathur/orca_mini_v7_7b-AWQ --quantization awq ``` Then you can access the Chat API by: ```bash curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "pankajmathur/orca_mini_v7_7b-AWQ", "messages": [ {"role": "system", "content": "You are Orca Mini, a helpful AI assistant."}, {"role": "user", "content": "Hello Orca Mini, what can you do for me?"} ] }' ``` **Note**: Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts**. We advise adding the `rope_scaling` configuration only when processing long contexts is required.