ctranslate2-4you commited on
Commit
4ce6750
1 Parent(s): d6ba0d1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +116 -1
README.md CHANGED
@@ -1,4 +1,119 @@
1
  ---
2
  base_model:
3
  - mistralai/Mistral-Small-Instruct-2409
4
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  base_model:
3
  - mistralai/Mistral-Small-Instruct-2409
4
+ ---
5
+
6
+ # Mistral-Small-Instruct CTranslate2 Model
7
+
8
+ This repository contains a CTranslate2 version of the [Mistral-Small-Instruct model](https://huggingface.co/mistralai/Mistral-Small-Instruct-2409). The conversion process involved AWQ quantization followed by CTranslate2 format conversion.
9
+
10
+ ## Quantization Parameters
11
+
12
+ The following AWQ parameters were used:
13
+ ```zero_point=true```
14
+ ```q_group_size=128```
15
+ ```w_bit=4```
16
+ ```version=gemv```
17
+
18
+ ## Quantization Process
19
+
20
+ The quantization was performed using the [AutoAWQ library](https://casper-hansen.github.io/AutoAWQ/examples/). AutoAWQ supports two quantization approaches:
21
+
22
+ 1. **Without calibration data**:
23
+ - Quick process (~few minutes)
24
+ - Uses standard quantization schema
25
+ - Suitable for general use cases
26
+
27
+ 2. **With calibration data**:
28
+ - Longer process (3-4 hours on RTX 4090)
29
+ - Preserves full precision for task-specific weights
30
+ - Slightly better performance for targeted tasks
31
+
32
+ ## Calibration Details
33
+
34
+ This model was quantized with calibration data. Specifically, the [cosmopedia-100k](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia-100k) dataset was used, which is good for overall QA and instruction-following.
35
+
36
+ Key parameters:
37
+ - `max_calib_seq_len`: 8192 (enables long-form responses)
38
+ - `text_token_length`: 2048 (minimum input token length during quantization)
39
+
40
+ While these parameters don't fundamentally alter the model's architecture, they fine-tune its behavior for specific input-output length patterns and topic domains.
41
+
42
+ ## Requirements
43
+
44
+ ```torch 2.2.2```
45
+ ```ctranslate2 4.4.0```
46
+ - NOTE: The soon-to-be-released ```ctranslate2 4.5.0``` will support ```torch``` greater than version 2.2.2. These instructions will be updated when that occurs.
47
+
48
+ ## Sample Script
49
+
50
+ ```
51
+ import os
52
+ import sys
53
+ import ctranslate2
54
+ import gc
55
+ import torch
56
+ from transformers import AutoTokenizer
57
+
58
+ system_message = "You are a helpful person who answers questions."
59
+ user_message = "Hello, how are you today? I'd like you to write me a funny poem that is a parody of Milton's Paradise Lost if you are familiar with that famous epic poem?"
60
+
61
+ model_dir = r"D:\Scripts\bench_chat\models\mistralai--Mistral-Small-Instruct-2409-AWQ-ct2-awq" # uses ~13.8 GB
62
+
63
+
64
+ def build_prompt_mistral_small():
65
+ prompt = f"""<s>
66
+ [INST] {system_message}
67
+
68
+ {user_message}[/INST]"""
69
+
70
+ return prompt
71
+
72
+
73
+ def main():
74
+ model_name = os.path.basename(model_dir)
75
+
76
+ print(f"\033[32mLoading the model: {model_name}...\033[0m")
77
+
78
+ intra_threads = max(os.cpu_count() - 4, 4)
79
+
80
+ generator = ctranslate2.Generator(
81
+ model_dir,
82
+ device="cuda",
83
+ # compute_type="int8_bfloat16", # NOTE...YOU DO NOT USE THIS AT ALL WHEN USING AWQ/CTRANSLATE2 MODELS
84
+ intra_threads=intra_threads
85
+ )
86
+
87
+ tokenizer = AutoTokenizer.from_pretrained(model_dir, add_prefix_space=None)
88
+
89
+ prompt = build_prompt_mistral_small()
90
+
91
+ tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))
92
+
93
+ print(f"\nRun 1 (Beam Size: {beam_size}):")
94
+
95
+ results_batch = generator.generate_batch(
96
+ [tokens],
97
+ include_prompt_in_result=False,
98
+ max_batch_size=4096,
99
+ batch_type="tokens",
100
+ beam_size=1,
101
+ num_hypotheses=1,
102
+ max_length=512,
103
+ sampling_temperature=0.0,
104
+ )
105
+
106
+ output = tokenizer.decode(results_batch[0].sequences_ids[0])
107
+
108
+ print("\nGenerated response:")
109
+ print(output)
110
+
111
+ del generator
112
+ del tokenizer
113
+ torch.cuda.empty_cache()
114
+ gc.collect()
115
+
116
+
117
+ if __name__ == "__main__":
118
+ main()
119
+ ```