Does Hugging Face have special handling in deploying models? Why does the model perform so well compared to the test results published by others?

#3
by ansir - opened

Does Hugging Face have special handling in deploying models? Why does the model perform so well compared to the test results published by others?

Hugging Face H4 org

Hi @ansir are you referring to the leaderboard? Our deployments are decoupled from evaluation, with the latter run via EleutherAI's evaluation harness - are there some specific results that concern you?

what's the space configuration that is being used? It's faster over here

Hi @ansir are you referring to the leaderboard? Our deployments are decoupled from evaluation, with the latter run via EleutherAI's evaluation harness - are there some specific results that concern you?

@lewtun In other people's running examples, the A6000 was used to deploy and test Falcon, but the results were not very satisfactory. I'm curious, in the HG team, do you use machines that perform better for model deployment, or are there other optimization techniques involved?

I think @ansir means latency in text generation. Falcon-40b on Hugging Face Space can achieve 18 TPS, but many other users (https://huggingface.co/TheBloke/falcon-40b-instruct-GPTQ and us) observed it is very slow, only 0.7 TPS

Does Hugging Face have special handling in deploying models? Why does the model perform so well compared to the test results published by others?

I see this space was using 2X A100 GPU. So it expected to be fast.

While I use single RTX A6000 with 48 GB VRAM GPU just got 1-2 tokens/second

Sign up or log in to comment