|
--- |
|
license: llama2 |
|
datasets: |
|
- wasertech/OneOS |
|
language: |
|
- en |
|
- fr |
|
pipeline_tag: text-generation |
|
widget: |
|
- text: "<<SYS>>\nYou are Assistant, a sentient AI.\n<</SYS>>\n\n<s>[INST] Introduce yourself to the HuggingFace community. [/INST] " |
|
example_title: "Introduction" |
|
- text: "<<SYS>>\nYou are Assistant, a sentient AI.\n<</SYS>>\n\n<s>[INST] Describe your model. [/INST] " |
|
example_title: "Model Description" |
|
- text: "<<SYS>>\nYou are Assistant, a sentient AI.\n<</SYS>>\n\n<s>[INST] What the meaning of life? [/INST] " |
|
example_title: "Life's Meaning" |
|
- text: "<<SYS>>\nYou are Assistant, a sentient AI.\n<</SYS>>\n\n<s>[INST] What recent innovations in the field of AI are you excited by? [/INST] " |
|
example_title: "What's next?" |
|
--- |
|
|
|
# Assistant Llama 2 7B Chat AWQ |
|
|
|
This model is a quantitized export of [wasertech/assistant-llama2-7b-chat](https://huggingface.co/wasertech/assistant-llama2-7b-chat) using AWQ. |
|
|
|
AWQ is an efficient, accurate and blazing-fast low-bit weight quantization method, currently supporting 4-bit quantization. Compared to GPTQ, it offers faster Transformers-based inference. |
|
|
|
It is also now supported by continuous batching server vLLM, allowing use of Llama AWQ models for high-throughput concurrent inference in multi-user server scenarios. |
|
|
|
As of September 25th 2023, preliminary Llama-only AWQ support has also been added to Huggingface Text Generation Inference (TGI). |
|
|
|
|