Update README.md
Browse files
README.md
CHANGED
@@ -15,10 +15,18 @@ For more details about SwiftKV and how to use it:
|
|
15 |
|
16 |
## Performance Metrics
|
17 |
|
18 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
19 |
<img src="figure-4.png" alt="performance plot of llama-405B w. swiftkv" width="400">
|
20 |
-
Legend: blue - baseline FP8, pink - SwiftKV FP8<br>
|
21 |
|
|
|
|
|
22 |
|
23 |
|
24 |
## Eval Metrics
|
|
|
15 |
|
16 |
## Performance Metrics
|
17 |
|
18 |
+
To evaluate SwiftKV’s performance, we focus on the following key metrics:
|
19 |
+
* Combined throughput: The total number of input and output tokens processed per second. This determines:
|
20 |
+
For batch processing, the time required to complete jobs.
|
21 |
+
For interactive use, the volume of concurrent requests a system can handle.
|
22 |
+
* TTFT: The latency between a user request and receiving the first token in the response.
|
23 |
+
* TPOT: The latency between subsequent tokens after the first token.
|
24 |
+
|
25 |
+
Combined input and output throughput for Llama 3.1 405B across a range of input lengths. Blue is baseline FP8 and Ping is SwiftKV FP8.
|
26 |
<img src="figure-4.png" alt="performance plot of llama-405B w. swiftkv" width="400">
|
|
|
27 |
|
28 |
+
TTFT (top) and TPOT (bottom) for input lengths 2000 (left), 8000 (middle), and 32000 (right) for Llama 3.1 405B fp8 model. For each experiment, a range of different request arrival rates is simulated. Each request generates 256 output tokens.
|
29 |
+
<img src="figure-6.png" alt="performance plot of llama-405B w. swiftkv" width="700">
|
30 |
|
31 |
|
32 |
## Eval Metrics
|