reach-vb HF staff alvarobartt HF staff commited on
Commit
4c64cc4
1 Parent(s): efb4b3b

Update README.md (#6)

Browse files

- Update README.md (2b69026a6b73f7b265c23b06e57f22fa587998ba)


Co-authored-by: Alvaro Bartolome <alvarobartt@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +134 -1
README.md CHANGED
@@ -116,7 +116,140 @@ The AutoAWQ script has been adapted from [AutoAWQ/examples/generate.py](https://
116
 
117
  ### 🤗 Text Generation Inference (TGI)
118
 
119
- Coming soon!
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
120
 
121
  ## Quantization Reproduction
122
 
 
116
 
117
  ### 🤗 Text Generation Inference (TGI)
118
 
119
+ To run the `text-generation-launcher` with Llama 3.1 405B Instruct AWQ in INT4 with Marlin kernels for optimized inference speed, you will need to have Docker installed (see [installation notes](https://docs.docker.com/engine/install/)) and the `huggingface_hub` Python package as you need to login to the Hugging Face Hub.
120
+
121
+ ```bash
122
+ pip install -q --upgrade huggingface_hub
123
+ huggingface-cli login
124
+ ```
125
+
126
+ Then you just need to run the TGI v2.2.0 (or higher) Docker container as follows:
127
+
128
+ ```bash
129
+ docker run --gpus all --shm-size 1g -ti -p 8080:80 \
130
+ -v hf_cache:/data \
131
+ -e MODEL_ID=hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4 \
132
+ -e NUM_SHARD=8 \
133
+ -e QUANTIZE=awq \
134
+ -e HF_TOKEN=$(cat ~/.cache/huggingface/token) \
135
+ -e MAX_INPUT_LENGTH=4000 \
136
+ -e MAX_TOTAL_TOKENS=4096 \
137
+ ghcr.io/huggingface/text-generation-inference:2.2.0
138
+ ```
139
+
140
+ > [!NOTE]
141
+ > TGI will expose different endpoints, to see all the endpoints available check [TGI OpenAPI Specification](https://huggingface.github.io/text-generation-inference/#/).
142
+
143
+ To send request to the deployed TGI endpoint compatible with [OpenAI OpenAPI specification](https://github.com/openai/openai-openapi) i.e. `/v1/chat/completions`:
144
+
145
+ ```bash
146
+ curl 0.0.0.0:8080/v1/chat/completions \
147
+ -X POST \
148
+ -H 'Content-Type: application/json' \
149
+ -d '{
150
+ "model": "hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4",
151
+ "messages": [
152
+ {
153
+ "role": "system",
154
+ "content": "You are a helpful assistant."
155
+ },
156
+ {
157
+ "role": "user",
158
+ "content": "What is Deep Learning?"
159
+ }
160
+ ],
161
+ "max_tokens": 128
162
+ }'
163
+ ```
164
+
165
+ Or programatically via the `huggingface_hub` Python client as follows:
166
+
167
+ ```python
168
+ import os
169
+ from huggingface_hub import InferenceClient
170
+
171
+ client = InferenceClient(base_url="http://0.0.0.0:8080", api_key=os.getenv("HF_TOKEN", "-"))
172
+
173
+ chat_completion = client.chat.completions.create(
174
+ model="hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4",
175
+ messages=[
176
+ {"role": "system", "content": "You are a helpful assistant."},
177
+ {"role": "user", "content": "What is Deep Learning?"},
178
+ ],
179
+ max_tokens=128,
180
+ )
181
+ ```
182
+
183
+ Alternatively, the OpenAI Python client can also be used (see [installation notes](https://github.com/openai/openai-python?tab=readme-ov-file#installation)) as follows:
184
+
185
+ ```python
186
+ import os
187
+ from openai import OpenAI
188
+
189
+ client = OpenAI(base_url="http://0.0.0.0:8080/v1", api_key=os.getenv("OPENAI_API_KEY", "-"))
190
+
191
+ chat_completion = client.chat.completions.create(
192
+ model="tgi",
193
+ messages=[
194
+ {"role": "system", "content": "You are a helpful assistant."},
195
+ {"role": "user", "content": "What is Deep Learning?"},
196
+ ],
197
+ max_tokens=128,
198
+ )
199
+ ```
200
+
201
+ ### vLLM
202
+
203
+ To run vLLM with Llama 3.1 405B Instruct AWQ in INT4, you will need to have Docker installed (see [installation notes](https://docs.docker.com/engine/install/)) and run the latest vLLM Docker container as follows:
204
+
205
+ ```bash
206
+ docker run --runtime nvidia --gpus all --ipc=host -p 8000:8000 \
207
+ -v hf_cache:/root/.cache/huggingface \
208
+ vllm/vllm-openai:latest \
209
+ --model hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4 \
210
+ --tensor-parallel-size 8 \
211
+ --max-model-len 4096
212
+ ```
213
+
214
+ To send request to the deployed vLLM endpoint compatible with [OpenAI OpenAPI specification](https://github.com/openai/openai-openapi) i.e. `/v1/chat/completions`:
215
+
216
+ ```bash
217
+ curl 0.0.0.0:8000/v1/chat/completions \
218
+ -X POST \
219
+ -H 'Content-Type: application/json' \
220
+ -d '{
221
+ "model": "hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4",
222
+ "messages": [
223
+ {
224
+ "role": "system",
225
+ "content": "You are a helpful assistant."
226
+ },
227
+ {
228
+ "role": "user",
229
+ "content": "What is Deep Learning?"
230
+ }
231
+ ],
232
+ "max_tokens": 128
233
+ }'
234
+ ```
235
+
236
+ Or programatically via the `openai` Python client (see [installation notes](https://github.com/openai/openai-python?tab=readme-ov-file#installation)) as follows:
237
+
238
+ ```python
239
+ import os
240
+ from openai import OpenAI
241
+
242
+ client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key=os.getenv("VLLM_API_KEY", "-"))
243
+
244
+ chat_completion = client.chat.completions.create(
245
+ model="hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4",
246
+ messages=[
247
+ {"role": "system", "content": "You are a helpful assistant."},
248
+ {"role": "user", "content": "What is Deep Learning?"},
249
+ ],
250
+ max_tokens=128,
251
+ )
252
+ ```
253
 
254
  ## Quantization Reproduction
255