File size: 7,411 Bytes
da52452 4c7cdda da52452 4c7cdda da52452 4c7cdda da52452 4c7cdda da52452 4c7cdda da52452 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 |
{
"cells": [
{
"cell_type": "markdown",
"id": "28e3460d-59b1-4d4c-b62e-510987fb2f28",
"metadata": {},
"source": [
"# Introduction\n",
"This notebook is to show how to launch the TGI Benchmark tool. \n",
"\n",
"## Warning\n",
"Please note that the TGI Benchmark tool is designed to work in a terminal, not a jupyter notebook. This means you will need to copy/paste the command in a jupyter terminal tab. I am putting them here for convenience."
]
},
{
"cell_type": "markdown",
"id": "c0de3cc9-c6cd-45b3-9dd0-84b3cb2fc8b2",
"metadata": {},
"source": [
"Here we can see the different settings for TGI Benchmark. \n",
"\n",
"Here are some of the most important ones:\n",
"\n",
"- `--tokenizer-name` This is required so the tool knows what tokenizer to use\n",
"- `--batch-size` This is important for load testing. We should use more and more values to see what happens to throughput and latency\n",
"- `--sequence-length` AKA input tokens, it is important to match your use-case needs\n",
"- `--decode-length` AKA output tokens, it is important to match your use-case needs\n",
"- `--runs` Use fewer when you are exploring, but the default 10 is good for your final estimate"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "694df6d6-a521-4dab-977b-2828d4250781",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Text Generation Benchmarking tool\n",
"\n",
"\u001b[1m\u001b[4mUsage:\u001b[0m \u001b[1mtext-generation-benchmark\u001b[0m [OPTIONS] \u001b[1m--tokenizer-name\u001b[0m <TOKENIZER_NAME>\n",
"\n",
"\u001b[1m\u001b[4mOptions:\u001b[0m\n",
" \u001b[1m-t\u001b[0m, \u001b[1m--tokenizer-name\u001b[0m <TOKENIZER_NAME>\n",
" The name of the tokenizer (as in model_id on the huggingface hub, or local path) [env: TOKENIZER_NAME=]\n",
" \u001b[1m--revision\u001b[0m <REVISION>\n",
" The revision to use for the tokenizer if on the hub [env: REVISION=] [default: main]\n",
" \u001b[1m-b\u001b[0m, \u001b[1m--batch-size\u001b[0m <BATCH_SIZE>\n",
" The various batch sizes to benchmark for, the idea is to get enough batching to start seeing increased latency, this usually means you're moving from memory bound (usual as BS=1) to compute bound, and this is a sweet spot for the maximum batch size for the model under test\n",
" \u001b[1m-s\u001b[0m, \u001b[1m--sequence-length\u001b[0m <SEQUENCE_LENGTH>\n",
" This is the initial prompt sent to the text-generation-server length in token. Longer prompt will slow down the benchmark. Usually the latency grows somewhat linearly with this for the prefill step [env: SEQUENCE_LENGTH=] [default: 10]\n",
" \u001b[1m-d\u001b[0m, \u001b[1m--decode-length\u001b[0m <DECODE_LENGTH>\n",
" This is how many tokens will be generated by the server and averaged out to give the `decode` latency. This is the *critical* number you want to optimize for LLM spend most of their time doing decoding [env: DECODE_LENGTH=] [default: 8]\n",
" \u001b[1m-r\u001b[0m, \u001b[1m--runs\u001b[0m <RUNS>\n",
" How many runs should we average from [env: RUNS=] [default: 10]\n",
" \u001b[1m-w\u001b[0m, \u001b[1m--warmups\u001b[0m <WARMUPS>\n",
" Number of warmup cycles [env: WARMUPS=] [default: 1]\n",
" \u001b[1m-m\u001b[0m, \u001b[1m--master-shard-uds-path\u001b[0m <MASTER_SHARD_UDS_PATH>\n",
" The location of the grpc socket. This benchmark tool bypasses the router completely and directly talks to the gRPC processes [env: MASTER_SHARD_UDS_PATH=] [default: /tmp/text-generation-server-0]\n",
" \u001b[1m--temperature\u001b[0m <TEMPERATURE>\n",
" Generation parameter in case you want to specifically test/debug particular decoding strategies, for full doc refer to the `text-generation-server` [env: TEMPERATURE=]\n",
" \u001b[1m--top-k\u001b[0m <TOP_K>\n",
" Generation parameter in case you want to specifically test/debug particular decoding strategies, for full doc refer to the `text-generation-server` [env: TOP_K=]\n",
" \u001b[1m--top-p\u001b[0m <TOP_P>\n",
" Generation parameter in case you want to specifically test/debug particular decoding strategies, for full doc refer to the `text-generation-server` [env: TOP_P=]\n",
" \u001b[1m--typical-p\u001b[0m <TYPICAL_P>\n",
" Generation parameter in case you want to specifically test/debug particular decoding strategies, for full doc refer to the `text-generation-server` [env: TYPICAL_P=]\n",
" \u001b[1m--repetition-penalty\u001b[0m <REPETITION_PENALTY>\n",
" Generation parameter in case you want to specifically test/debug particular decoding strategies, for full doc refer to the `text-generation-server` [env: REPETITION_PENALTY=]\n",
" \u001b[1m--frequency-penalty\u001b[0m <FREQUENCY_PENALTY>\n",
" Generation parameter in case you want to specifically test/debug particular decoding strategies, for full doc refer to the `text-generation-server` [env: FREQUENCY_PENALTY=]\n",
" \u001b[1m--watermark\u001b[0m\n",
" Generation parameter in case you want to specifically test/debug particular decoding strategies, for full doc refer to the `text-generation-server` [env: WATERMARK=]\n",
" \u001b[1m--do-sample\u001b[0m\n",
" Generation parameter in case you want to specifically test/debug particular decoding strategies, for full doc refer to the `text-generation-server` [env: DO_SAMPLE=]\n",
" \u001b[1m--top-n-tokens\u001b[0m <TOP_N_TOKENS>\n",
" Generation parameter in case you want to specifically test/debug particular decoding strategies, for full doc refer to the `text-generation-server` [env: TOP_N_TOKENS=]\n",
" \u001b[1m-h\u001b[0m, \u001b[1m--help\u001b[0m\n",
" Print help (see more with '--help')\n",
" \u001b[1m-V\u001b[0m, \u001b[1m--version\u001b[0m\n",
" Print version\n"
]
}
],
"source": [
"!text-generation-benchmark -h"
]
},
{
"cell_type": "markdown",
"id": "42d9561b-1aea-4c8c-9fe8-e36af43482fe",
"metadata": {},
"source": [
"Here is an example command. Notice that I add the batch sizes of interest repeatedly to make sure all of them are used by the benchmark tool.\n",
"```bash\n",
"\n",
"text-generation-benchmark \\\n",
"--tokenizer-name astronomer/Llama-3-8B-Instruct-GPTQ-8-Bit \\\n",
"--sequence-length 3000 \\\n",
"--decode-length 300 \\\n",
"--batch-size 1 \\\n",
"--batch-size 2 \\\n",
"--batch-size 3 \\\n",
"--batch-size 4 \\\n",
"--batch-size 5\n",
"```"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.8"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
|