MBMMurad
/

QwQ-32B-preview-AWQ-AIMO-earlysharing

Text Generation

text-generation-inference

Inference Endpoints

4-bit precision

Model card Files Files and versions Community

QwQ-32B-preview-AWQ-AIMO-earlysharing / README.md

MBMMurad's picture

Add files using upload-large-folder tool

0cde28b verified 14 days ago

|

3.51 kB

	---
	license: apache-2.0
	base_model:
	- Qwen/QwQ-32B-Preview
	language:
	- en
	pipeline_tag: text-generation
	library_name: transformers
	---
	# QwQ-32B-Preview AWQ 4-Bit Quantized Version

	## Introduction

	This repository provides the AWQ 4-bit quantized version of the QwQ-32B-Preview model, originally developed by the Qwen Team. The quantized model significantly reduces memory usage and computational requirements, making it suitable for deployment on hardware with limited resources.

	Note: This quantized model requires approximately 20 GB of VRAM to run effectively.

	QwQ-32B-Preview is an experimental research model aimed at advancing AI reasoning capabilities, particularly in mathematics and coding tasks. While it shows promising analytical abilities, it has several important limitations:

	- Language Mixing and Code Switching: The model may unexpectedly switch between languages or mix them, affecting the clarity of responses.
	- Recursive Reasoning Loops: There's a possibility of the model entering circular reasoning patterns, leading to lengthy responses without conclusive answers.
	- Safety and Ethical Considerations: Enhanced safety measures are needed to ensure reliable and secure performance. Users should exercise caution when deploying the model.
	- Performance Limitations: While excelling in math and coding, the model may underperform in areas like common sense reasoning and nuanced language understanding.

	---

	## Requirements

	Ensure you are using the latest version of Hugging Face Transformers, as the code for Qwen2.5 is integrated there. Using a version earlier than 4.37.0 may result in the following error:

	```plaintext
	KeyError: 'qwen2'
	```

	---

	## Quickstart

	Here's how to load the tokenizer and model, and generate content using the quantized model:

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_name = "KirillR/QwQ-32B-Preview-AWQ"

	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype="auto",
	device_map="auto"
	)
	tokenizer = AutoTokenizer.from_pretrained(model_name)

	prompt = "How many 'r's are in 'strawberry'?"
	messages = [
	{"role": "system", "content": "You are a helpful assistant developed by Alibaba. Please think step-by-step."},
	{"role": "user", "content": prompt}
	]
	text = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True
	)
	model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

	generated_ids = model.generate(
	**model_inputs,
	max_new_tokens=1024
	)
	generated_ids = [
	output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
	]

	response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

	print(response)
	```

	---

	## Original Model

	For more details about the original QwQ-32B-Preview model, please refer to the following resource:

	https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct-AWQ


	---

	## Citation

	If you find the original model helpful, please consider citing the original authors:

	```bibtext
	@misc{qwq-32b-preview,
	title = {QwQ: Reflect Deeply on the Boundaries of the Unknown},
	url = {https://qwenlm.github.io/blog/qwq-32b-preview/},
	author = {Qwen Team},
	month = {November},
	year = {2024}
	}

	@article{qwen2,
	title={Qwen2 Technical Report},
	author={An Yang and Baosong Yang and others},
	journal={arXiv preprint arXiv:2407.10671},
	year={2024}
	}
	```