|
--- |
|
language: |
|
- en |
|
license: apache-2.0 |
|
datasets: |
|
- cerebras/SlimPajama-627B |
|
- bigcode/starcoderdata |
|
model_name: Tinyllama 1.1B Intermediate Step 1431K 3T |
|
model_creator: TinyLlama |
|
model_type: tinyllama |
|
prompt_template: '{prompt}' |
|
quantized_by: Znerual |
|
--- |
|
|
|
|
|
# Tinyllama 1.1B Intermediate Step 1431K 3T - AWQ |
|
|
|
## Description |
|
|
|
### About AWQ |
|
|
|
AWQ is an efficient, accurate and blazing-fast low-bit weight quantization method, currently supporting 4-bit quantization. Compared to GPTQ, it offers faster Transformers-based inference with equivalent or better quality compared to the most commonly used GPTQ settings. |
|
|
|
AWQ models are currently supported on Linux and Windows, with NVidia GPUs only. macOS users: please use GGUF models instead. |
|
|
|
It is supported by: |
|
|
|
- [Text Generation Webui](https://github.com/oobabooga/text-generation-webui) - using Loader: AutoAWQ |
|
- [vLLM](https://github.com/vllm-project/vllm) - version 0.2.2 or later for support for all model types. |
|
- [Hugging Face Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) |
|
- [Transformers](https://huggingface.co/docs/transformers) version 4.35.0 and later, from any code or client that supports Transformers |
|
- [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) - for use from Python code |
|
|
|
<!-- description end --> |
|
<!-- repositories-available start --> |
|
|
|
<!-- README_AWQ.md-provided-files end --> |
|
|
|
<!-- README_AWQ.md-text-generation-webui start --> |
|
## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui) |
|
|
|
Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui). |
|
|
|
It is strongly recommended to use the text-generation-webui one-click-installers unless you're sure you know how to make a manual install. |
|
|
|
1. Click the **Model tab**. |
|
2. Under **Download custom model or LoRA**, enter `Znerual/TinyLlama-1.1B-intermediate-step-1431k-3T-AWQ`. |
|
3. Click **Download**. |
|
4. The model will start downloading. Once it's finished it will say "Done". |
|
5. In the top left, click the refresh icon next to **Model**. |
|
6. In the **Model** dropdown, choose the model you just downloaded: `TinyLlama-1.1B-intermediate-step-1431k-3T-AWQ` |
|
7. Select **Loader: AutoAWQ**. |
|
8. Click Load, and the model will load and is now ready for use. |
|
9. If you want any custom settings, set them and then click **Save settings for this model** followed by **Reload the Model** in the top right. |
|
10. Once you're ready, click the **Text Generation** tab and enter a prompt to get started! |
|
<!-- README_AWQ.md-text-generation-webui end --> |
|
|
|
<!-- README_AWQ.md-use-from-vllm start --> |
|
## Multi-user inference server: vLLM |
|
|
|
Documentation on installing and using vLLM [can be found here](https://vllm.readthedocs.io/en/latest/). |
|
|
|
- Please ensure you are using vLLM version 0.2 or later. |
|
- When using vLLM as a server, pass the `--quantization awq` parameter. |
|
|
|
For example: |
|
|
|
```shell |
|
python3 -m vllm.entrypoints.api_server --model Znerual/TinyLlama-1.1B-intermediate-step-1431k-3T-AWQ --quantization awq --dtype auto |
|
``` |
|
|
|
- When using vLLM from Python code, again set `quantization=awq`. |
|
|
|
For example: |
|
|
|
```python |
|
from vllm import LLM, SamplingParams |
|
|
|
prompts = [ |
|
"Tell me about AI", |
|
"Write a story about llamas", |
|
"What is 291 - 150?", |
|
"How much wood would a woodchuck chuck if a woodchuck could chuck wood?", |
|
] |
|
prompt_template=f'''[INST] {prompt} [/INST] |
|
''' |
|
|
|
prompts = [prompt_template.format(prompt=prompt) for prompt in prompts] |
|
|
|
sampling_params = SamplingParams(temperature=0.8, top_p=0.95) |
|
|
|
llm = LLM(model="Znerual/TinyLlama-1.1B-intermediate-step-1431k-3T-AWQ", quantization="awq", dtype="auto") |
|
|
|
outputs = llm.generate(prompts, sampling_params) |
|
|
|
# Print the outputs. |
|
for output in outputs: |
|
prompt = output.prompt |
|
generated_text = output.outputs[0].text |
|
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") |
|
``` |
|
<!-- README_AWQ.md-use-from-vllm start --> |
|
|
|
<!-- README_AWQ.md-use-from-tgi start --> |
|
## Multi-user inference server: Hugging Face Text Generation Inference (TGI) |
|
|
|
Use TGI version 1.1.0 or later. The official Docker container is: `ghcr.io/huggingface/text-generation-inference:1.1.0` |
|
|
|
Example Docker parameters: |
|
|
|
```shell |
|
--model-id Znerual/TinyLlama-1.1B-intermediate-step-1431k-3T-AWQ --port 3000 --quantize awq --max-input-length 1902 --max-total-tokens 2048 --max-batch-prefill-tokens 2048 |
|
``` |
|
|
|
Example Python code for interfacing with TGI (requires [huggingface-hub](https://github.com/huggingface/huggingface_hub) 0.17.0 or later): |
|
|
|
```shell |
|
pip3 install huggingface-hub |
|
``` |
|
|
|
```python |
|
from huggingface_hub import InferenceClient |
|
|
|
endpoint_url = "https://your-endpoint-url-here" |
|
|
|
prompt = "Tell me about AI" |
|
prompt_template=f'''[INST] {prompt} [/INST] |
|
''' |
|
|
|
client = InferenceClient(endpoint_url) |
|
response = client.text_generation(prompt, |
|
max_new_tokens=128, |
|
do_sample=True, |
|
temperature=0.7, |
|
top_p=0.95, |
|
top_k=40, |
|
repetition_penalty=1.1) |
|
|
|
print(f"Model output: ", response) |
|
``` |
|
<!-- README_AWQ.md-use-from-tgi end --> |
|
|
|
<!-- README_AWQ.md-use-from-python start --> |
|
## Inference from Python code using Transformers |
|
|
|
### Install the necessary packages |
|
|
|
- Requires: [Transformers](https://huggingface.co/docs/transformers) 4.35.0 or later. |
|
- Requires: [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) 0.1.6 or later. |
|
|
|
```shell |
|
pip3 install --upgrade "autoawq>=0.1.6" "transformers>=4.35.0" |
|
``` |
|
|
|
Note that if you are using PyTorch 2.0.1, the above AutoAWQ command will automatically upgrade you to PyTorch 2.1.0. |
|
|
|
If you are using CUDA 11.8 and wish to continue using PyTorch 2.0.1, instead run this command: |
|
|
|
```shell |
|
pip3 install https://github.com/casper-hansen/AutoAWQ/releases/download/v0.1.6/autoawq-0.1.6+cu118-cp310-cp310-linux_x86_64.whl |
|
``` |
|
|
|
If you have problems installing [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) using the pre-built wheels, install it from source instead: |
|
|
|
```shell |
|
pip3 uninstall -y autoawq |
|
git clone https://github.com/casper-hansen/AutoAWQ |
|
cd AutoAWQ |
|
pip3 install . |
|
``` |
|
|
|
### Transformers example code (requires Transformers 4.35.0 and later) |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer |
|
|
|
model_name_or_path = "Znerual/TinyLlama-1.1B-intermediate-step-1431k-3T-AWQ" |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) |
|
model = AutoModelForCausalLM.from_pretrained( |
|
model_name_or_path, |
|
low_cpu_mem_usage=True, |
|
device_map="cuda:0" |
|
) |
|
|
|
# Using the text streamer to stream output one token at a time |
|
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True) |
|
|
|
prompt = "Tell me about AI" |
|
prompt_template=f'''[INST] {prompt} [/INST] |
|
''' |
|
|
|
# Convert prompt to tokens |
|
tokens = tokenizer( |
|
prompt_template, |
|
return_tensors='pt' |
|
).input_ids.cuda() |
|
|
|
generation_params = { |
|
"do_sample": True, |
|
"temperature": 0.7, |
|
"top_p": 0.95, |
|
"top_k": 40, |
|
"max_new_tokens": 512, |
|
"repetition_penalty": 1.1 |
|
} |
|
|
|
# Generate streamed output, visible one token at a time |
|
generation_output = model.generate( |
|
tokens, |
|
streamer=streamer, |
|
**generation_params |
|
) |
|
|
|
# Generation without a streamer, which will include the prompt in the output |
|
generation_output = model.generate( |
|
tokens, |
|
**generation_params |
|
) |
|
|
|
# Get the tokens from the output, decode them, print them |
|
token_output = generation_output[0] |
|
text_output = tokenizer.decode(token_output) |
|
print("model.generate output: ", text_output) |
|
|
|
# Inference is also possible via Transformers' pipeline |
|
from transformers import pipeline |
|
|
|
pipe = pipeline( |
|
"text-generation", |
|
model=model, |
|
tokenizer=tokenizer, |
|
**generation_params |
|
) |
|
|
|
pipe_output = pipe(prompt_template)[0]['generated_text'] |
|
print("pipeline output: ", pipe_output) |
|
|
|
``` |
|
<!-- README_AWQ.md-use-from-python end --> |
|
|
|
<!-- README_AWQ.md-compatibility start --> |
|
## Compatibility |
|
|
|
The files provided are tested to work with: |
|
|
|
- [text-generation-webui](https://github.com/oobabooga/text-generation-webui) using `Loader: AutoAWQ`. |
|
- [vLLM](https://github.com/vllm-project/vllm) version 0.2.0 and later. |
|
- [Hugging Face Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) version 1.1.0 and later. |
|
- [Transformers](https://huggingface.co/docs/transformers) version 4.35.0 and later. |
|
- [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) version 0.1.1 and later. |
|
|
|
|
|
# Original model card: Tinyllama 1.1B |
|
|
|
</div> |
|
|
|
https://github.com/jzhang38/TinyLlama |
|
|
|
The TinyLlama project aims to **pretrain** a **1.1B Llama model on 3 trillion tokens**. With some proper optimization, we can achieve this within a span of "just" 90 days using 16 A100-40G GPUs ๐๐. The training has started on 2023-09-01. |
|
|
|
<div align="center"> |
|
<img src="./TinyLlama_logo.png" width="300"/> |
|
</div> |
|
|
|
We adopted exactly the same architecture and tokenizer as Llama 2. This means TinyLlama can be plugged and played in many open-source projects built upon Llama. Besides, TinyLlama is compact with only 1.1B parameters. This compactness allows it to cater to a multitude of applications demanding a restricted computation and memory footprint. |
|
|
|
#### This Collection |
|
This collection contains all checkpoints after the 1T fix. Branch name indicates the step and number of tokens seen. |
|
|
|
#### Eval |
|
|
|
| Model | Pretrain Tokens | HellaSwag | Obqa | WinoGrande | ARC_c | ARC_e | boolq | piqa | avg | |
|
|-------------------------------------------|-----------------|-----------|------|------------|-------|-------|-------|------|-----| |
|
| Pythia-1.0B | 300B | 47.16 | 31.40| 53.43 | 27.05 | 48.99 | 60.83 | 69.21 | 48.30 | |
|
| TinyLlama-1.1B-intermediate-step-50K-104b | 103B | 43.50 | 29.80| 53.28 | 24.32 | 44.91 | 59.66 | 67.30 | 46.11| |
|
| TinyLlama-1.1B-intermediate-step-240k-503b| 503B | 49.56 |31.40 |55.80 |26.54 |48.32 |56.91 |69.42 | 48.28 | |
|
| TinyLlama-1.1B-intermediate-step-480k-1007B | 1007B | 52.54 | 33.40 | 55.96 | 27.82 | 52.36 | 59.54 | 69.91 | 50.22 | |
|
| TinyLlama-1.1B-intermediate-step-715k-1.5T | 1.5T | 53.68 | 35.20 | 58.33 | 29.18 | 51.89 | 59.08 | 71.65 | 51.29 | |
|
| TinyLlama-1.1B-intermediate-step-955k-2T | 2T | 54.63 | 33.40 | 56.83 | 28.07 | 54.67 | 63.21 | 70.67 | 51.64 | |
|
| TinyLlama-1.1B-intermediate-step-1195k-2.5T | 2.5T | 58.96 | 34.40 | 58.72 | 31.91 | 56.78 | 63.21 | 73.07 | 53.86| |
|
| TinyLlama-1.1B-intermediate-step-1431k-3T | 3T | 59.20 | 36.00 | 59.12 | 30.12 | 55.25 | 57.83 | 73.29 | 52.99| |