base_model: unsloth/Llama-3.2-1B-bnb-4bit
language:
- en
license: apache-2.0
tags:
- text-generation-inference
- transformers
- unsloth
- llama
- trl
Mental Health Chatbot using 1B finetuned Llama 3.2 Model
Inference
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "ImranzamanML/1B_finetuned_llama3.2",
max_seq_length = 5020,
dtype = None,
load_in_4bit = True)
Using this text to feed into model for getting the response
text="I'm going through some things with my feelings and myself. I barely sleep and I do nothing but think about how I'm worthless and how I shouldn't be here. I've never tried or contemplated suicide. I've always wanted to fix my issues, but I never get around to it. How can I change my feeling of being worthless to everyone?"
Here is some keys to note:
The model = FastLanguageModel.for_inference(model)
configures the model specifically for inference, optimizing its performance for generating responses.
The input text is tokenized using the tokenizer
, it convert the text into a format that model can process. We are using data_prompt
to format the input text, while the response placeholder is left empty to get response from model. The return_tensors = "pt"
parameter specifies that the output should be in PyTorch tensors, which are then moved to the GPU using .to("cuda")
for faster processing.
The model.generate
method generating response based on the tokenized inputs. The parameters max_new_tokens = 5020
and use_cache = True
ensure that the model can produce long and coherent responses efficiently by utilizing cached computation from previous layers.
model = FastLanguageModel.for_inference(model)
inputs = tokenizer(
[
data_prompt.format(
#instructions
text,
#answer
"",
)
], return_tensors = "pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens = 5020, use_cache = True)
answer=tokenizer.batch_decode(outputs)
answer = answer[0].split("### Response:")[-1]
print("Answer of the question is:", answer)