monsoon-nlp
/

mGPT-quantized

Text Generation

text-generation-inference

Inference Endpoints

8-bit precision

Model card Files Files and versions Community

mGPT-quantized / README.md

monsoon-nlp's picture

Update README.md

f70ac82 over 1 year ago

|

history blame contribute delete

1.97 kB

	---
	license: apache-2.0
	language:
	- ar
	- hi
	- id
	pipeline_tag: text-generation
	tags:
	- multilingual
	widget:
	- text: 'في مدرستي السابقة'
	example_title: Arabic prompt
	- text: 'आप समुद्री लुटेरों के बारे में क्या जानते हैं?'
	example_title: Hindi prompt
	- text: 'Kucing saya suka'
	example_title: Indonesian prompt
	---

	# mGPT-quantized

	The concept: 8-bit quantized version of [mGPT](https://huggingface.co/ai-forever/mGPT), a 1.3B param model released by AI-Forever / Sberbank AI in April 2022.

	On the GPT scale, it is a similar # of parameters to GPT2-XL, but on 60+ languages.

	AI-Forever also released a 13B-parameter model. I made an 8-bit quantized version with weights available here: https://huggingface.co/monsoon-nlp/mGPT-13B-quantized

	My goal is to evaluate this on Arabic, Hindi, and Indonesian tasks, where there are fewer autoregressive language models in this size range.

	For English: use a GPT model or LLaMa2-7B

	In August 2023 [AI-Forever](https://huggingface.co/ai-forever) added 1.3B-param models for about 1/3 of the model's languages. If your language is Mongolian, for example, use mGPT-1.3B-mongol and not this one.

	## How was the model created?

	Quantization of mGPT 1.3B was done using `bitsandbytes` library:

	```python
	from transformers import BitsAndBytesConfig, GPT2LMHeadModel

	quantization_config = BitsAndBytesConfig(
	load_in_8bit=True,
	bnb_8bit_compute_dtype=torch.bfloat16,
	bnb_8bit_use_double_quant=True,
	bnb_8bit_quant_type="nf4",
	)

	qmodel = GPT2LMHeadModel.from_pretrained(
	"ai-forever/mGPT",
	load_in_8bit=True,
	torch_dtype=torch.bfloat16,
	quantization_config=quantization_config,
	device_map="auto"
	)

	qmodel.save_pretrained("model_name")
	```

	## Future steps

	- mGPT could be further quantized (4-bit), but `model.save_pretrained()` currently throws a `NotImplementedError` error.