Text Generation
Transformers
Safetensors
Thai
English
qwen2
text-generation-inference
sft
trl
4-bit precision
bitsandbytes
LoRA
Fine-Tuning with LoRA
LLM
GenAI
NT GenAI
ntgenai
lahnmah
NT Thai GPT
ntthaigpt
medical
medtech
HealthGPT
หลานม่า
NT Academy
conversational
Inference Endpoints
4-bit precision
Update README.md
Browse files
README.md
CHANGED
@@ -113,3 +113,70 @@ model_path = 'amornpan/openthaigpt-MedChatModelv11'
|
|
113 |
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
114 |
tokenizer.pad_token = tokenizer.eos_token
|
115 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
113 |
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
114 |
tokenizer.pad_token = tokenizer.eos_token
|
115 |
```
|
116 |
+
|
117 |
+
## 3. Prepare Your Input (Custom Prompt)
|
118 |
+
|
119 |
+
Create a custom medical prompt that you want the model to respond to:
|
120 |
+
|
121 |
+
```python
|
122 |
+
custom_prompt = "โปรดอธิบายลักษณะช่องปากที่เป็นมะเร็งในระยะเริ่มต้น"
|
123 |
+
PROMPT = f'[INST] <You are a question answering assistant. Answer the question as truthfully and helpfully as possible. คุณคือผู้ช่วยตอบคำถาม จงตอบคำถามอย่างถูกต้องและมีประโยชน์ที่สุด<>{custom_prompt}[/INST]'
|
124 |
+
|
125 |
+
# Tokenize the input prompt
|
126 |
+
inputs = tokenizer(PROMPT, return_tensors="pt", padding=True, truncation=True)
|
127 |
+
|
128 |
+
```
|
129 |
+
|
130 |
+
## 4. Configure the Model for Efficient Loading (4-bit Quantization)
|
131 |
+
|
132 |
+
The model uses 4-bit precision for efficient inference. Here’s how to set up the configuration:
|
133 |
+
|
134 |
+
```python
|
135 |
+
bnb_config = BitsAndBytesConfig(
|
136 |
+
load_in_4bit=True,
|
137 |
+
bnb_4bit_quant_type="nf4",
|
138 |
+
bnb_4bit_compute_dtype=torch.float16
|
139 |
+
)
|
140 |
+
```
|
141 |
+
|
142 |
+
## 5. Load the Model with Quantization Support
|
143 |
+
|
144 |
+
Now, load the model with the 4-bit quantization settings:
|
145 |
+
|
146 |
+
```python
|
147 |
+
model = AutoModelForCausalLM.from_pretrained(
|
148 |
+
model_path,
|
149 |
+
quantization_config=bnb_config,
|
150 |
+
trust_remote_code=True
|
151 |
+
)
|
152 |
+
```
|
153 |
+
|
154 |
+
## 6. Move the Model and Inputs to the GPU (if available)
|
155 |
+
|
156 |
+
For faster inference, move the model and input tensors to a GPU, if available:
|
157 |
+
|
158 |
+
```python
|
159 |
+
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
160 |
+
model.to(device)
|
161 |
+
inputs = {k: v.to(device) for k, v in inputs.items()}
|
162 |
+
```
|
163 |
+
|
164 |
+
## 7. Generate a Response from the Model
|
165 |
+
|
166 |
+
Now, generate the medical response by running the model:
|
167 |
+
|
168 |
+
```python
|
169 |
+
outputs = model.generate(**inputs, max_new_tokens=200, do_sample=True)
|
170 |
+
```
|
171 |
+
|
172 |
+
## 8. Decode the Generated Text
|
173 |
+
|
174 |
+
Finally, decode and print the response from the model:
|
175 |
+
|
176 |
+
```python
|
177 |
+
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
178 |
+
print(generated_text)
|
179 |
+
```
|
180 |
+
|
181 |
+
## More Information
|
182 |
+
```amornpan@gmail.com```
|