JRosenkranz
commited on
Commit
•
924f16b
1
Parent(s):
0699e13
updated readme with samples
Browse files
README.md
CHANGED
@@ -1,3 +1,72 @@
|
|
1 |
---
|
2 |
license: llama2
|
3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: llama2
|
3 |
+
---
|
4 |
+
|
5 |
+
To try this out running in a production-like environment, please use the pre-built docker image:
|
6 |
+
|
7 |
+
```bash
|
8 |
+
docker pull docker-eu-public.artifactory.swg-devops.com/res-zrl-snap-docker-local/tgis-os:spec.7
docker run -d --rm --gpus all \
|
9 |
+
--name my-tgis-server \
|
10 |
+
-v /path/to/all/models:/models \
|
11 |
+
-e MODEL_NAME=/models/model_weights/llama/13B-F \
|
12 |
+
-e SPECULATOR_PATH=/models/speculator_weights/llama/13B-F \
|
13 |
+
-e FLASH_ATTENTION=true \
|
14 |
+
-e PAGED_ATTENTION=true \
|
15 |
+
-e DTYPE_STR=float16 \
|
16 |
+
docker-eu-public.artifactory.swg-devops.com/res-zrl-snap-docker-local/tgis-os:spec.7
|
17 |
+
|
18 |
+
docker logs my-tgis-server -f
|
19 |
+
docker exec -it my-tgis-server python /path-to-example-code/sample_client.py
|
20 |
+
```
|
21 |
+
|
22 |
+
To try this out with the fms-native compiled model, please execute the following:
|
23 |
+
|
24 |
+
#### batch_size=1 (compile + cudagraphs)
|
25 |
+
|
26 |
+
```bash
|
27 |
+
git clone https://github.com/foundation-model-stack/fms-extras
|
28 |
+
(cd fms-extras && pip install -e .)
|
29 |
+
pip install transformers==4.35.0 sentencepiece numpy
|
30 |
+
python fms-extras/scripts/paged_speculative_inference.py \
|
31 |
+
--variant=13b \
|
32 |
+
--model_path=/path/to/model_weights/llama/13B-F \
|
33 |
+
--model_source=hf \
|
34 |
+
--tokenizer=/path/to/llama/13B-F \
|
35 |
+
--speculator_path=/path/to/speculator_weights/llama/13B-F \
|
36 |
+
--speculator_source=hf \
|
37 |
+
--compile \
|
38 |
+
--compile_mode=reduce-overhead
|
39 |
+
```
|
40 |
+
|
41 |
+
#### batch_size=1 (compile)
|
42 |
+
|
43 |
+
```bash
|
44 |
+
git clone https://github.com/foundation-model-stack/fms-extras
|
45 |
+
(cd fms-extras && pip install -e .)
|
46 |
+
pip install transformers==4.35.0 sentencepiece numpy
|
47 |
+
python fms-extras/scripts/paged_speculative_inference.py \
|
48 |
+
--variant=13b \
|
49 |
+
--model_path=/path/to/model_weights/llama/13B-F \
|
50 |
+
--model_source=hf \
|
51 |
+
--tokenizer=/path/to/llama/13B-F \
|
52 |
+
--speculator_path=/path/to/speculator_weights/llama/13B-F \
|
53 |
+
--speculator_source=hf \
|
54 |
+
--compile \
|
55 |
+
```
|
56 |
+
|
57 |
+
#### batch_size=4 (compile)
|
58 |
+
|
59 |
+
```bash
|
60 |
+
git clone https://github.com/foundation-model-stack/fms-extras
|
61 |
+
(cd fms-extras && pip install -e .)
|
62 |
+
pip install transformers==4.35.0 sentencepiece numpy
|
63 |
+
python fms-extras/scripts/paged_speculative_inference.py \
|
64 |
+
--variant=13b \
|
65 |
+
--model_path=/path/to/model_weights/llama/13B-F \
|
66 |
+
--model_source=hf \
|
67 |
+
--tokenizer=/path/to/llama/13B-F \
|
68 |
+
--speculator_path=/path/to/speculator_weights/llama/13B-F \
|
69 |
+
--speculator_source=hf \
|
70 |
+
--batch_input \
|
71 |
+
--compile \
|
72 |
+
```
|