Qwen
/

Qwen2-72B-Instruct-GPTQ-Int4

Text Generation

Inference Endpoints

text-generation-inference

4-bit precision

Model card Files Files and versions Community

feihu.hf commited on 27 days ago

Commit

09af798

•

1 Parent(s): 09b3178

update README.md

Files changed (1) hide show

README.md +6 -0

README.md CHANGED Viewed

@@ -130,6 +130,12 @@ Or you can install vLLM from [source](https://github.com/vllm-project/vllm/).
 **Note**: Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts**. We advise adding the `rope_scaling` configuration only when processing long contexts is required.
 ## Citation
 If you find our work helpful, feel free to give us a cite.

 **Note**: Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts**. We advise adding the `rope_scaling` configuration only when processing long contexts is required.
+## Benchmark and Speed
+To compare the generation performance between bfloat16 (bf16) and quantized models such as GPTQ-Int8, GPTQ-Int4, and AWQ, please consult our [Benchmark of Quantized Models](https://qwen.readthedocs.io/en/latest/benchmark/quantization_benchmark.html). This benchmark provides insights into how different quantization techniques affect model performance.
+For those interested in understanding the inference speed and memory consumption when deploying these models with either ``transformer`` or ``vLLM``, we have compiled an extensive [Speed Benchmark](https://qwen.readthedocs.io/en/latest/benchmark/speed_benchmark.html).
 ## Citation
 If you find our work helpful, feel free to give us a cite.