Benchmark results compared to orig fp8 / int4 quants etc?
Can you evaluate this fp4 quant? How it compares to others out there.
Do we have ppl score for FP4 data type? Thank you very much.
Yes we can evaluate this FP4 model, see this post for MMLU and cost per token. For more benchmark results, please stay tuned.
@zhiyucheng Thank you. I noticed the word "99.8% of FP8 on MMLU". But I didn't see it being collected by any official websites. Usually in llm-harnessing tests, we will have a weighted score over major tasks, not only general knowledge testing.
For R1, reasoning is very import aspect we would like to know.
It will be great if we have any snotshopts about version of llm-harnessing used, snapshots of evaluation, fp4 conversion scripts and methods.
I am always care about FP4, not only because FP4 multiplication may be dramatically different from FP8 (only 16 variants), but also potential speedup under current arch . How would it look like with wgmma in hoper tensorcore? A lot of questions.
Hello @llama-ultra-producer ,
Thank you for your interest with all the questions!
If you're looking for more details on FP4 quantization please take a look at our Github page - https://github.com/NVIDIA/TensorRT-Model-Optimizer
We are open source so feel free to dig in if you're interested, and don't hesitate to try things out.
Please stay tuned for more detailed benchmark results.