ctranslate2-4you
/

Mistral-Small-Instruct-2409-ct2-AWQ

Safetensors

mistral

4-bit precision

awq

Model card Files Files and versions Community

ctranslate2-4you commited on Oct 22, 2024

Commit

4ce6750

•

1 Parent(s): d6ba0d1

Update README.md

Browse files

Files changed (1) hide show

README.md +116 -1

README.md CHANGED Viewed

@@ -1,4 +1,119 @@
 ---
 base_model:
 - mistralai/Mistral-Small-Instruct-2409
----

 ---
 base_model:
 - mistralai/Mistral-Small-Instruct-2409
+---
+# Mistral-Small-Instruct CTranslate2 Model
+This repository contains a CTranslate2 version of the [Mistral-Small-Instruct model](https://huggingface.co/mistralai/Mistral-Small-Instruct-2409). The conversion process involved AWQ quantization followed by CTranslate2 format conversion.
+## Quantization Parameters
+The following AWQ parameters were used:
+```zero_point=true```
+```q_group_size=128```
+```w_bit=4```
+```version=gemv```
+## Quantization Process
+The quantization was performed using the [AutoAWQ library](https://casper-hansen.github.io/AutoAWQ/examples/). AutoAWQ supports two quantization approaches:
+1. **Without calibration data**:
+   - Quick process (~few minutes)
+   - Uses standard quantization schema
+   - Suitable for general use cases
+2. **With calibration data**:
+   - Longer process (3-4 hours on RTX 4090)
+   - Preserves full precision for task-specific weights
+   - Slightly better performance for targeted tasks
+## Calibration Details
+This model was quantized with calibration data.  Specifically, the [cosmopedia-100k](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia-100k) dataset was used, which is good for overall QA and instruction-following.
+Key parameters:
+- `max_calib_seq_len`: 8192 (enables long-form responses)
+- `text_token_length`: 2048 (minimum input token length during quantization)
+While these parameters don't fundamentally alter the model's architecture, they fine-tune its behavior for specific input-output length patterns and topic domains.
+## Requirements
+```torch 2.2.2```
+```ctranslate2 4.4.0```
+- NOTE: The soon-to-be-released ```ctranslate2 4.5.0``` will support ```torch``` greater than version 2.2.2.  These instructions will be updated when that occurs.
+## Sample Script
+```
+import os
+import sys
+import ctranslate2
+import gc
+import torch
+from transformers import AutoTokenizer
+system_message = "You are a helpful person who answers questions."
+user_message = "Hello, how are you today? I'd like you to write me a funny poem that is a parody of Milton's Paradise Lost if you are familiar with that famous epic poem?"
+model_dir = r"D:\Scripts\bench_chat\models\mistralai--Mistral-Small-Instruct-2409-AWQ-ct2-awq" # uses ~13.8 GB
+def build_prompt_mistral_small():
+    prompt = f"""<s>
+[INST] {system_message}
+{user_message}[/INST]"""
+    return prompt
+def main():
+    model_name = os.path.basename(model_dir)
+    print(f"\033[32mLoading the model: {model_name}...\033[0m")
+    intra_threads = max(os.cpu_count() - 4, 4)
+    generator = ctranslate2.Generator(
+        model_dir,
+        device="cuda",
+        # compute_type="int8_bfloat16", # NOTE...YOU DO NOT USE THIS AT ALL WHEN USING AWQ/CTRANSLATE2 MODELS
+        intra_threads=intra_threads
+    )
+    tokenizer = AutoTokenizer.from_pretrained(model_dir, add_prefix_space=None)
+    prompt = build_prompt_mistral_small()
+    tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))
+    print(f"\nRun 1 (Beam Size: {beam_size}):")
+    results_batch = generator.generate_batch(
+        [tokens],
+        include_prompt_in_result=False,
+        max_batch_size=4096,
+        batch_type="tokens",
+        beam_size=1,
+        num_hypotheses=1,
+        max_length=512,
+        sampling_temperature=0.0,
+    )
+    output = tokenizer.decode(results_batch[0].sequences_ids[0])
+    print("\nGenerated response:")
+    print(output)
+    del generator
+    del tokenizer
+    torch.cuda.empty_cache()
+    gc.collect()
+if __name__ == "__main__":
+    main()
+```