QwQ-32B-Preview AWQ 4-Bit Quantized - AIMO Early Prize Winner Solution Model

Introduction

This model is slightly modified from the QwQ-32B-preview-AWQ model, which is the AWQ Quantized version of the QwQ-32B-preview model by Qwen. This model was used in the Early Prize Sharing Notebook in the AI Mathematical Olympiad - Progress Prize 2. This repository provides the AWQ 4-bit quantized version of the QwQ-32B-Preview model, originally developed by the Qwen Team. The quantized version was prepared by Kirill Rybkin. The quantized model significantly reduces memory usage and computational requirements, making it suitable for deployment on hardware with limited resources.

Note: This quantized model requires approximately 20 GB of VRAM to run effectively.

QwQ-32B-Preview is an experimental research model aimed at advancing AI reasoning capabilities, particularly in mathematics and coding tasks. While it shows promising analytical abilities, it has several important limitations:

  • Language Mixing and Code Switching: The model may unexpectedly switch between languages or mix them, affecting the clarity of responses.
  • Recursive Reasoning Loops: There's a possibility of the model entering circular reasoning patterns, leading to lengthy responses without conclusive answers.
  • Safety and Ethical Considerations: Enhanced safety measures are needed to ensure reliable and secure performance. Users should exercise caution when deploying the model.
  • Performance Limitations: While excelling in math and coding, the model may underperform in areas like common sense reasoning and nuanced language understanding.

Steps to deploying the solution to Inference Endpoints (dedicated)

Use this approach if you want to try out the approach from my Kaggle notebook, but you don't feel comfortable with coding.

Make sure to use the following values on the Parameters field, once you have started a container on the dedicated endpoint, in order to ensure that the model performance is close to the one from the Kaggle notebook.

With OpenAI API selected :

  • Top P = 1
  • Temperature = 1
  • Max Tokens = 32768

Without OpenAI API :

  • Top K = 50
  • Top P = 1
  • Temperature = 1
  • Max Tokens = 32768
  • Do Sample = True

It is better to use the OpenAI-API since it will provide better backend optimization and handle any errors if occured.

Other values should be left at default values.

Note: The container in which the model runs has to have a max token value that is slightly higher than the max_token value you set in parameters (since some tokens might be reserved for other issues). So, if you want to use a max_token of 32768 as a parameter, set the container value to 33000. Do this best when you create the endpoint by settings by selecting "Container Configuration", and updating "Max Number of Tokens (per Query)" field to 33000. You can change this post-container creation by updating the container, but this is buggy and sometimes the container consistently fails to update. Also make sure during model creation that you have selected a GPU with sufficient vRAM.


Requirements

Ensure you are using the latest version of Hugging Face Transformers, as the code for Qwen2.5 is integrated there. Using a version earlier than 4.37.0 may result in the following error:

KeyError: 'qwen2'

Quickstart

Here's how to load the tokenizer and model, and generate content using the quantized model:


#pip install autoawq
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "MBMMurad/QwQ-32B-preview-AWQ-AIMO-earlysharing"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Three airline companies operate flights from Dodola island. Each company has a different schedule of departures. The first company departs every 100 days, the second every 120 days and the third every 150 days. What is the greatest positive integer $d$ for which it is true that there will be $d$ consecutive days without a flight from Dodola island, regardless of the departure times of the various airlines?"
messages = [
    {"role": "system", "content": "You are a helpful assistant developed by Alibaba. Please reason step by step."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

print(response)

Parameter values to get a better response and reproducible values similar to the Kaggle notebook:

model.generate( **model_inputs, max_new_tokens=512 )

Use the sampling method here setting the following parameters:

  • do_sample = True
  • temperature = 1
  • top_k = 50

Setting the max_new_tokens to 4096*8 would increase the performance, but it will take a lot of time for inference. Using faster inference engines (e.g. vLLM, TGI) would make the inference faster.

To get the most optimal performance, it is suggested to use the Kaggle notebook mentioned above.

Original Model

For more details about the original QwQ-32B-Preview model, please refer to the following resource:

https://huggingface.co/Qwen/QwQ-32B-Preview


Citation

If you find the original model helpful, please consider citing the original authors as well as the Kaggle notebook on which this model is based on:

@misc{qwq-32b-preview,
    title = {QwQ: Reflect Deeply on the Boundaries of the Unknown},
    url = {https://qwenlm.github.io/blog/qwq-32b-preview/},
    author = {Qwen Team},
    month = {November},
    year = {2024}
}

@article{qwen2,
      title={Qwen2 Technical Report}, 
      author={An Yang and Baosong Yang and others},
      journal={arXiv preprint arXiv:2407.10671},
      year={2024}
}

@misc{Murad2024earlysharingprize,
  author       = "Md Boktiar Mahbub Murad", 
  title        = "QWQ-32B-preview Optimized inference Early Sharing Prize winner",
  howpublished = "\url{https://www.kaggle.com/code/mbmmurad/lb-20-qwq-32b-preview-optimized-inference}",
  month        = "Dec", 
  year         = "2024", 
  note         = "More ain't always better", 
}
Downloads last month
150
Safetensors
Model size
5.73B params
Tensor type
I32
·
FP16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for MBMMurad/QwQ-32B-preview-AWQ-AIMO-earlysharing

Base model

Qwen/Qwen2.5-32B
Quantized
(1)
this model