qaihm-bot commited on
Commit
6c35f88
1 Parent(s): 9a7f9da

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +41 -8
README.md CHANGED
@@ -15,7 +15,7 @@ tags:
15
  # Llama-v2-7B-Chat: Optimized for Mobile Deployment
16
  ## State-of-the-art large language model useful on a variety of language understanding and generation tasks
17
 
18
- Llama 2 is a family of LLMs. The "Chat" at the end indicates that the model is optimized for chatbot-like dialogue. The model is quantized to 4-bit weights and 16-bit activations making it suitable for on-device deployment. For Prompt and output length specified below, the time to first token is Llama-PromptProcessor-Quantized's latency and average time per addition token is Llama-TokenGenerator-KVCache-Quantized's latency.
19
 
20
  This model is an implementation of Llama-v2-7B-Chat found [here](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf).
21
  This repository provides scripts to run Llama-v2-7B-Chat on Qualcomm® devices.
@@ -28,17 +28,18 @@ More details on model performance across various devices, can be found
28
  - **Model Type:** Text generation
29
  - **Model Stats:**
30
  - Number of parameters: 7B
31
- - Model size: 3.6GB
32
  - Model-1 (Prompt Processor): Llama-PromptProcessor-Quantized
33
  - Max context length: 1024
 
34
  - Prompt processor input: 1024 tokens
35
  - Prompt processor output: 1024 output tokens + KVCache for token generator
36
  - Model-2 (Token Generator): Llama-TokenGenerator-KVCache-Quantized
 
37
  - Token generator input: 1 input token + past KVCache
38
  - Token generator output: 1 output token + KVCache for next iteration
39
  - Decoding length: 1024 (1 output token + 1023 from KVCache)
40
  - Use: Initiate conversation with prompt-processor and then token generator for subsequent iterations.
41
- - QNN-SDK: 2.19
42
 
43
  ## Deploying Llama 2 on-device
44
 
@@ -61,14 +62,46 @@ Here, we divide the model into 4 parts in order to
61
 
62
  In order to export Llama 2, please ensure
63
  1. Host machine has >40GB memory (RAM+swap-space)
64
- 2. If you don't have enough memory, export.py will dump instructions to increase swap space accordingly
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
65
 
66
 
67
 
68
  | Device | Chipset | Target Runtime | Inference Time (ms) | Peak Memory Range (MB) | Precision | Primary Compute Unit | Target Model
69
  | ---|---|---|---|---|---|---|---|
70
- | Samsung Galaxy S23 Ultra (Android 13) | Snapdragon® 8 Gen 2 | QNN Model Library | 104.953 ms | 316 - 4785 MB | UINT16 | NPU | Llama-TokenGenerator-KVCache-Quantized
71
- | Samsung Galaxy S23 Ultra (Android 13) | Snapdragon® 8 Gen 2 | QNN Model Library | 1917.811 ms | 0 - 1028 MB | UINT16 | NPU | Llama-PromptProcessor-Quantized
72
 
73
 
74
 
@@ -128,14 +161,14 @@ python -m qai_hub_models.models.llama_v2_7b_chat_quantized.export
128
  ```
129
 
130
  ```
131
- Profile Job summary of Llama-TokenGenerator-KVCache-Quantized
132
  --------------------------------------------------
133
  Device: Snapdragon X Elite CRD (11)
134
  Estimated Inference Time: 118.14 ms
135
  Estimated Peak Memory Range: 64.97-64.97 MB
136
  Compute Units: NPU (34842) | Total (34842)
137
 
138
- Profile Job summary of Llama-PromptProcessor-Quantized
139
  --------------------------------------------------
140
  Device: Snapdragon X Elite CRD (11)
141
  Estimated Inference Time: 2302.57 ms
 
15
  # Llama-v2-7B-Chat: Optimized for Mobile Deployment
16
  ## State-of-the-art large language model useful on a variety of language understanding and generation tasks
17
 
18
+ Llama 2 is a family of LLMs. The "Chat" at the end indicates that the model is optimized for chatbot-like dialogue. The model is quantized to w4a16(4-bit weights and 16-bit activations) and part of the model is quantized to w8a16(8-bit weights and 16-bit activations) making it suitable for on-device deployment. For Prompt and output length specified below, the time to first token is Llama-PromptProcessor-Quantized's latency and average time per addition token is Llama-TokenGenerator-KVCache-Quantized's latency.
19
 
20
  This model is an implementation of Llama-v2-7B-Chat found [here](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf).
21
  This repository provides scripts to run Llama-v2-7B-Chat on Qualcomm® devices.
 
28
  - **Model Type:** Text generation
29
  - **Model Stats:**
30
  - Number of parameters: 7B
31
+ - Precision: w4a16 + w8a16 (few layers)
32
  - Model-1 (Prompt Processor): Llama-PromptProcessor-Quantized
33
  - Max context length: 1024
34
+ - Prompt processor model size: 3.6 GB
35
  - Prompt processor input: 1024 tokens
36
  - Prompt processor output: 1024 output tokens + KVCache for token generator
37
  - Model-2 (Token Generator): Llama-TokenGenerator-KVCache-Quantized
38
+ - Token generator model size: 3.6 GB
39
  - Token generator input: 1 input token + past KVCache
40
  - Token generator output: 1 output token + KVCache for next iteration
41
  - Decoding length: 1024 (1 output token + 1023 from KVCache)
42
  - Use: Initiate conversation with prompt-processor and then token generator for subsequent iterations.
 
43
 
44
  ## Deploying Llama 2 on-device
45
 
 
62
 
63
  In order to export Llama 2, please ensure
64
  1. Host machine has >40GB memory (RAM+swap-space)
65
+ 2. If you don't have enough memory, export.py will dump instructions to increase swap space accordingly.
66
+
67
+ ## Sample output prompts generated on-device
68
+ 1. --prompt "what is gravity?" --max-output-tokens 30
69
+ ~~~
70
+ -------- Response Summary --------
71
+ Prompt: what is gravity?
72
+ Response: Hello! I'm here to help you answer your question. Gravity is a fundamental force of nature that affects the behavior of objects with mass
73
+ ~~~
74
+
75
+ 2. --prompt "what is 2+3?" --max-output-tokens 30
76
+ ~~~
77
+ -------- Response Summary --------
78
+ Prompt: what is 2+3?
79
+ Response: Of course! I'm happy to help! The answer to 2+3 is 5.
80
+ ~~~
81
+
82
+ 3. --prompt "could you please write code for fibonacci series in python?" --max-output-tokens 100
83
+ ~~~
84
+ -------- Response Summary --------
85
+ Prompt: could you please write code for fibonacci series in python?
86
+ Response: Of course! Here is an example of how you could implement the Fibonacci sequence in Python:
87
+ ```
88
+ def fibonacci(n):
89
+ if n <= 1:
90
+ return n
91
+ else:
92
+ return fibonacci(n-1) + fibonacci(n-2)
93
+ ```
94
+ You can test the function by calling it with different values of `n`, like this:
95
+ ```
96
+ print(fibonacci(5))
97
+ ~~~
98
 
99
 
100
 
101
  | Device | Chipset | Target Runtime | Inference Time (ms) | Peak Memory Range (MB) | Precision | Primary Compute Unit | Target Model
102
  | ---|---|---|---|---|---|---|---|
103
+ | Samsung Galaxy S23 Ultra (Android 13) | Snapdragon® 8 Gen 2 | QNN Model Library | 90.268 ms | 64 - 4351 MB | UINT16 | NPU | Llama2-TokenGenerator-KVCache-Quantized
104
+ | Samsung Galaxy S23 Ultra (Android 13) | Snapdragon® 8 Gen 2 | QNN Model Library | 1917.811 ms | 0 - 1028 MB | UINT16 | NPU | Llama2-PromptProcessor-Quantized
105
 
106
 
107
 
 
161
  ```
162
 
163
  ```
164
+ Profile Job summary of Llama2-TokenGenerator-KVCache-Quantized
165
  --------------------------------------------------
166
  Device: Snapdragon X Elite CRD (11)
167
  Estimated Inference Time: 118.14 ms
168
  Estimated Peak Memory Range: 64.97-64.97 MB
169
  Compute Units: NPU (34842) | Total (34842)
170
 
171
+ Profile Job summary of Llama2-PromptProcessor-Quantized
172
  --------------------------------------------------
173
  Device: Snapdragon X Elite CRD (11)
174
  Estimated Inference Time: 2302.57 ms