--- license: llama2 --- To try this out running in a production-like environment, please use the pre-built docker image: ```bash docker pull docker-eu-public.artifactory.swg-devops.com/res-zrl-snap-docker-local/tgis-os:spec.7
docker run -d --rm --gpus all \ --name my-tgis-server \ -v /path/to/all/models:/models \ -e MODEL_NAME=/models/model_weights/llama/13B-F \ -e SPECULATOR_PATH=/models/speculator_weights/llama/13B-F \ -e FLASH_ATTENTION=true \ -e PAGED_ATTENTION=true \ -e DTYPE_STR=float16 \ docker-eu-public.artifactory.swg-devops.com/res-zrl-snap-docker-local/tgis-os:spec.7 docker logs my-tgis-server -f docker exec -it my-tgis-server python /path-to-example-code/sample_client.py ``` To try this out with the fms-native compiled model, please execute the following: #### batch_size=1 (compile + cudagraphs) ```bash git clone https://github.com/foundation-model-stack/fms-extras (cd fms-extras && pip install -e .) pip install transformers==4.35.0 sentencepiece numpy python fms-extras/scripts/paged_speculative_inference.py \ --variant=13b \ --model_path=/path/to/model_weights/llama/13B-F \ --model_source=hf \ --tokenizer=/path/to/llama/13B-F \ --speculator_path=/path/to/speculator_weights/llama/13B-F \ --speculator_source=hf \ --compile \ --compile_mode=reduce-overhead ``` #### batch_size=1 (compile) ```bash git clone https://github.com/foundation-model-stack/fms-extras (cd fms-extras && pip install -e .) pip install transformers==4.35.0 sentencepiece numpy python fms-extras/scripts/paged_speculative_inference.py \ --variant=13b \ --model_path=/path/to/model_weights/llama/13B-F \ --model_source=hf \ --tokenizer=/path/to/llama/13B-F \ --speculator_path=/path/to/speculator_weights/llama/13B-F \ --speculator_source=hf \ --compile \ ``` #### batch_size=4 (compile) ```bash git clone https://github.com/foundation-model-stack/fms-extras (cd fms-extras && pip install -e .) pip install transformers==4.35.0 sentencepiece numpy python fms-extras/scripts/paged_speculative_inference.py \ --variant=13b \ --model_path=/path/to/model_weights/llama/13B-F \ --model_source=hf \ --tokenizer=/path/to/llama/13B-F \ --speculator_path=/path/to/speculator_weights/llama/13B-F \ --speculator_source=hf \ --batch_input \ --compile \ ```