FriendliAI/Meta-Llama-3-8B-fp8 · Any tutorials to run the model and checkout the ppl?

LeiWang1999

May 3, 2024

Thanks!

FriendliAI org May 7, 2024

This model checkpoint only can be used with Friendli Container. You can find the guide to pull and run Friendli Container at https://docs.friendli.ai/guides/container/running_friendli_container.

To calculate the PPL, you need to send the inference request to the serving endpoint created by Friendli Container. You will need to use options like include_output_logprobs and forced_output_tokens.
forced_output_tokens makes the serving engine generate your target tokens to compute their logprobs.
(https://docs.friendli.ai/openapi/create-completions)

Note that Friendli engine executes the actual (autoregressive) generation process. The process comprises multiple steps, where each step computes logprobs of a single next token.
This is different from and is slower than feeding an entire sequence and computing the logprobs of all of the tokens in a single step.

bdambrosio

May 8, 2024

My trial:latest downloaded yesterday says it doesn't recognize dtype fp8.
How do I actually load / run this?
I'm actually interested in running 70B, but there weren't any posts there.
I have 3x RTX6000 Ada, CUDA 12.4, etc so should be good to go?

I'm looking for high-throughput batch biomedical text processing

Thanks.

bdambrosio

May 8, 2024

DUH. RTFM, as they used to say. Never mind, found it.

bdambrosio

May 8, 2024

Actually, haven't been able to get your fp8 example to work. too bad.