JRosenkranz's picture
updated samples
593ddda verified
|
raw
history blame
3.05 kB
---
license: llama2
---
## Description
This model as intended to be used as an accelerator for llama 13B (chat).
Undlerlying implementation of Paged Attention KV-Cached and speculator can be found in https://github.com/foundation-model-stack/fms-extras
Production implementation using `fms-extras` implementation can be found in https://github.com/tdoublep/text-generation-inference/tree/speculative-decoding
## Samples
### Production Server Sample
*To try this out running in a production-like environment, please use the pre-built docker image:*
#### Setup
```bash
docker pull docker-eu-public.artifactory.swg-devops.com/res-zrl-snap-docker-local/tgis-os:spec.7
docker run -d --rm --gpus all \
--name my-tgis-server \
-p 8033:8033 \
-v /path/to/all/models:/models \
-e MODEL_NAME=/models/model_weights/llama/13B-F \
-e SPECULATOR_PATH=/models/speculator_weights/llama/13B-F \
-e FLASH_ATTENTION=true \
-e PAGED_ATTENTION=true \
-e DTYPE_STR=float16 \
docker-eu-public.artifactory.swg-devops.com/res-zrl-snap-docker-local/tgis-os:spec.7
# check logs and wait for "gRPC server started on port 8033" and "HTTP server started on port 3000"
docker logs my-tgis-server -f
# get the client sample (Note: The first prompt will take longer as there is a warmup time)
conda create -n tgis-env python=3.11
conda activate tgis-env
git clone --branch speculative-decoding --single-branch https://github.com/tdoublep/text-generation-inference.git
cd text-generation-inference/integration_tests
make gen-client
pip install . --no-cache-dir
```
#### Run Sample
```bash
python sample_client.py
```
### Minimal Sample
*To try this out with the fms-native compiled model, please execute the following:*
#### Install
```bash
git clone https://github.com/foundation-model-stack/fms-extras
(cd fms-extras && pip install -e .)
pip install transformers==4.35.0 sentencepiece numpy
```
#### Run Sample
##### batch_size=1 (compile + cudagraphs)
```bash
python fms-extras/scripts/paged_speculative_inference.py \
--variant=13b \
--model_path=/path/to/model_weights/llama/13B-F \
--model_source=hf \
--tokenizer=/path/to/llama/13B-F \
--speculator_path=/path/to/speculator_weights/llama/13B-F \
--speculator_source=hf \
--compile \
--compile_mode=reduce-overhead
```
##### batch_size=1 (compile)
```bash
python fms-extras/scripts/paged_speculative_inference.py \
--variant=13b \
--model_path=/path/to/model_weights/llama/13B-F \
--model_source=hf \
--tokenizer=/path/to/llama/13B-F \
--speculator_path=/path/to/speculator_weights/llama/13B-F \
--speculator_source=hf \
--compile \
```
##### batch_size=4 (compile)
```bash
python fms-extras/scripts/paged_speculative_inference.py \
--variant=13b \
--model_path=/path/to/model_weights/llama/13B-F \
--model_source=hf \
--tokenizer=/path/to/llama/13B-F \
--speculator_path=/path/to/speculator_weights/llama/13B-F \
--speculator_source=hf \
--batch_input \
--compile \
```