Snowflake
/

Llama-3.1-SwiftKV-8B-Instruct-FP8

compressed-tensors

Model card Files Files and versions Community

jeffra commited on Dec 5, 2024

Commit

cda426c

·

verified ·

1 Parent(s): 7399ed7

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -22,8 +22,8 @@ To evaluate SwiftKV’s performance, we focus on the following key metrics (see
 * TTFT: The latency between a user request and receiving the first token in the response.
 * TPOT: The latency between subsequent tokens after the first token.
-Combined input and output throughput for Llama 3.1 405B across a range of input lengths. Blue is baseline FP8 and Ping is SwiftKV FP8.
-<img src="figure-4.png" alt="performance plot of llama-405B w. swiftkv" width="400">
 TTFT (top) and TPOT (bottom) for input lengths 2000 (left), 8000 (middle), and 32000 (right) for Llama 3.1 405B fp8 model. For each experiment, a range of different request arrival rates is simulated. Each request generates 256 output tokens.
 <img src="figure-6.png" alt="performance plot of llama-405B w. swiftkv" width="700">

 * TTFT: The latency between a user request and receiving the first token in the response.
 * TPOT: The latency between subsequent tokens after the first token.
+Combined input and output throughput for Llama 3.1 70B (left) and Llama 3.1 405B (right) across a range of input lengths (bottom).
+<img src="figure-4-full.png" alt="performance plot of llama-405B w. swiftkv" width="800">
 TTFT (top) and TPOT (bottom) for input lengths 2000 (left), 8000 (middle), and 32000 (right) for Llama 3.1 405B fp8 model. For each experiment, a range of different request arrival rates is simulated. Each request generates 256 output tokens.
 <img src="figure-6.png" alt="performance plot of llama-405B w. swiftkv" width="700">