ImranzamanML's picture
Update README.md
f462c8a verified
metadata
base_model: unsloth/Llama-3.2-1B-bnb-4bit
language:
  - en
license: apache-2.0
tags:
  - text-generation-inference
  - transformers
  - unsloth
  - llama
  - trl

Mental Health Chatbot using 1B finetuned Llama 3.2 Model

Inference

from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "ImranzamanML/1B_finetuned_llama3.2",
max_seq_length = 5020,
dtype = None,
load_in_4bit = True)

Using this text to feed into model for getting the response

text="I'm going through some things with my feelings and myself. I barely sleep and I do nothing but think about how I'm worthless and how I shouldn't be here. I've never tried or contemplated suicide. I've always wanted to fix my issues, but I never get around to it. How can I change my feeling of being worthless to everyone?"

Note: Lets use the fine-tuned model for inference in order to generate responses based on mental health-related prompts !

Here is some keys to note:

    The model = FastLanguageModel.for_inference(model) configures the model specifically for inference, optimizing its performance for generating responses.

    The input text is tokenized using the tokenizer, it convert the text into a format that model can process. We are using data_prompt to format the input text, while the response placeholder is left empty to get response from model. The return_tensors = "pt" parameter specifies that the output should be in PyTorch tensors, which are then moved to the GPU using .to("cuda") for faster processing.

    The model.generate method generating response based on the tokenized inputs. The parameters max_new_tokens = 5020 and use_cache = True ensure that the model can produce long and coherent responses efficiently by utilizing cached computation from previous layers.

model = FastLanguageModel.for_inference(model)
inputs = tokenizer(
[
    data_prompt.format(
        #instructions
        text,
        #answer
        "",
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 5020, use_cache = True)
answer=tokenizer.batch_decode(outputs)
answer = answer[0].split("### Response:")[-1]
print("Answer of the question is:", answer)