ctranslate2-4you commited on
Commit
d6ba0d1
1 Parent(s): c8784e0

Create README.md

Browse files

# Mistral-Small-Instruct CTranslate2 Model

This repository contains a CTranslate2 version of the [Mistral-Small-Instruct model](https://huggingface.co/mistralai/Mistral-Small-Instruct-2409). The conversion process involved AWQ quantization followed by CTranslate2 format conversion.

## Quantization Parameters

The following AWQ parameters were used:
```zero_point=true```
```q_group_size=128```
```w_bit=4```
```version=gemv```

## Quantization Process

The quantization was performed using the [AutoAWQ library](https://casper-hansen.github.io/AutoAWQ/examples/). AutoAWQ supports two quantization approaches:

1. **Without calibration data**:
- Quick process (~few minutes)
- Uses standard quantization schema
- Suitable for general use cases

2. **With calibration data**:
- Longer process (3-4 hours on RTX 4090)
- Preserves full precision for task-specific weights
- Slightly better performance for targeted tasks

## Calibration Details

This model was quantized with calibration data. Specifically, the [cosmopedia-100k](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia-100k) dataset was used, which is good for overall QA and instruction-following.

Key parameters:
- `max_calib_seq_len`: 8192 (enables long-form responses)
- `text_token_length`: 2048 (minimum input token length during quantization)

While these parameters don't fundamentally alter the model's architecture, they fine-tune its behavior for specific input-output length patterns and topic domains.

## Requirements

```torch 2.2.2```
```ctranslate2 4.4.0```
- NOTE: The soon-to-be-released ```ctranslate2 4.5.0``` will support ```torch``` greater than version 2.2.2. These instructions will be updated when that occurs.

## Sample Script

```
import os
import sys
import ctranslate2
import gc
import torch
from transformers import AutoTokenizer

system_message = "You are a helpful person who answers questions."
user_message = "Hello, how are you today? I'd like you to write me a funny poem that is a parody of Milton's Paradise Lost if you are familiar with that famous epic poem?"

model_dir = r"D:\Scripts\bench_chat\models\mistralai--Mistral-Small-Instruct-2409-AWQ-ct2-awq" # uses ~13.8 GB


def build_prompt_mistral_small():
prompt = f"""<s>
[INST] {system_message}

{user_message}[/INST]"""

return prompt


def main():
model_name = os.path.basename(model_dir)

print(f"\033[32mLoading the model: {model_name}...\033[0m")

intra_threads = max(os.cpu_count() - 4, 4)

generator = ctranslate2.Generator(
model_dir,
device="cuda",
# compute_type="int8_bfloat16", # NOTE...YOU DO NOT USE THIS AT ALL WHEN USING AWQ/CTRANSLATE2 MODELS
intra_threads=intra_threads
)

tokenizer = AutoTokenizer.from_pretrained(model_dir, add_prefix_space=None)

prompt = build_prompt_mistral_small()

tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))

print(f"\nRun 1 (Beam Size: {beam_size}):")

results_batch = generator.generate_batch(
[tokens],
include_prompt_in_result=False,
max_batch_size=4096,
batch_type="tokens",
beam_size=1,
num_hypotheses=1,
max_length=512,
sampling_temperature=0.0,
)

output = tokenizer.decode(results_batch[0].sequences_ids[0])

print("\nGenerated response:")
print(output)

del generator
del tokenizer
torch.cuda.empty_cache()
gc.collect()


if __name__ == "__main__":
main()
```

Files changed (1) hide show
  1. README.md +4 -0
README.md ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ ---
2
+ base_model:
3
+ - mistralai/Mistral-Small-Instruct-2409
4
+ ---