Chinkara 7B (Improved)

Chinkara is a Large Language Model trained on timdettmers/openassistant-guanaco dataset based on Meta's brand new LLaMa-2 with 7 billion parameters using QLoRa Technique, optimized for small consumer size GPUs.

Information

For more information about the model please visit prp-e/chinkara on Github.

Inference Guide

NOTE: This part is for the time you want to load and infere the model on your local machine. You still need 8GB of VRAM on your GPU. The recommended GPU is at least a 2080!

Installing libraries

pip install  -U bitsandbytes
pip install  -U git+https://github.com/huggingface/transformers.git
pip install  -U git+https://github.com/huggingface/peft.git
pip install  -U git+https://github.com/huggingface/accelerate.git
pip install  -U datasets
pip install  -U einops

Loading the model

import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_name = "Trelis/Llama-2-7b-chat-hf-sharded-bf16" 
adapters_name = 'MaralGPT/chinkara-7b-improved' 

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_4bit=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    max_memory= {i: '24000MB' for i in range(torch.cuda.device_count())},
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type='nf4'
    ),
)
model = PeftModel.from_pretrained(model, adapters_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Setting the model up

from peft import LoraConfig, get_peft_model

model = PeftModel.from_pretrained(model, adapters_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Prompt and inference

prompt = "What is the answer to life, universe and everything?" 

prompt = f"###Human: {prompt} ###Assistant:"

inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")
outputs = model.generate(inputs=inputs.input_ids, max_new_tokens=50, temperature=0.5, repetition_penalty=1.0)
answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(answer)

Training procedure

The following bitsandbytes quantization config was used during training:

load_in_8bit: False
load_in_4bit: True
llm_int8_threshold: 6.0
llm_int8_skip_modules: None
llm_int8_enable_fp32_cpu_offload: False
llm_int8_has_fp16_weight: False
bnb_4bit_quant_type: nf4
bnb_4bit_use_double_quant: False
bnb_4bit_compute_dtype: float16

Framework versions

PEFT 0.5.0.dev0