metadata

library_name: peft
base_model: TheBloke/Llama-2-7b-Chat-GPTQ
pipeline_tag: text-generation
inference: false
license: openrail
language:
  - en
datasets:
  - flytech/python-codes-25k
co2_eq_emissions:
  emissions: 1190
  source: >-
    Quantifying the Carbon Emissions of Machine Learning
    https://mlco2.github.io/impact#compute
  training_type: finetuning
  hardware_used: 1 P100 16GB GPU
tags:
  - text2code
  - LoRA
  - GPTQ
  - Llama-2-7B-Chat
  - text2python
  - instruction2code

Llama-2-7b-Chat-GPTQ fine-tuned on PYTHON-CODES-25K

Generate Python code that accomplishes the task instructed.

LoRA Adpater Head

Description

Parameter Efficient Finetuning(PEFT) a 4bit quantized Llama-2-7b-Chat from TheBloke/Llama-2-7b-Chat-GPTQ on flytech/python-codes-25k dataset.

Language(s) (NLP): English
License: openrail
Qunatization: GPTQ 4bit
PEFT: LoRA
Finetuned from model TheBloke/Llama-2-7b-Chat-GPTQ
Dataset: flytech/python-codes-25k

Intended uses & limitations

Addressing the efficay of Quantization and PEFT. Implemented as a personal Project.

How to use

The quantized model is finetuned as PEFT. We have the trained Adapter.
Merging LoRA adapater with GPTQ quantized model is not yet supported.
So instead of loading a single finetuned model, we need to load the base
model and merge the finetuned adapter on top.

instruction = """"Help me set up my daily to-do list!""""

from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM

config = PeftConfig.from_pretrained("SwastikM/Llama-2-7B-Chat-text2code")      #PEFT Config
model = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7b-Chat-GPTQ")  #Loading the Base Model
model = PeftModel.from_pretrained(model, "SwastikM/Llama-2-7B-Chat-text2code") #Combining Trained Adapter with Base Model
tokenizer = AutoTokenizer.from_pretrained("SwastikM/Llama-2-7B-Chat-text2code")

inputs = tokenizer(instruction, return_tensors="pt").input_ids.to('cuda')
outputs = model.generate(inputs, max_new_tokens=500, do_sample=False, num_beams=1)
code = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(code)

Size Comparison

The table shows comparison VRAM requirements for loading and training of FP16 Base Model and 4bit GPTQ quantized model with PEFT. The value for base model referenced from Model Memory Calculator from HuggingFace

Model	Total Size	Training Using Adam
Base Model	12.37 GB	49.48 GP
4bitQuantized+PEFT	3.90 GB	11 GB

Training Details

Training Data

Dataset:gretelai/synthetic_text_to_sql

Trained on instruction column of 20,000 randomly shuffled data.

Training Procedure

HuggingFace Accelerate with Training Loop.

Training Hyperparameters

Optimizer: AdamW
lr: 2e-5
decay: linear
batch_size: 4
gradient_accumulation_steps: 8
global_step: 625

LoraConfig

r: 8
lora_alpha: 32
target_modules: ["k_proj","o_proj","q_proj","v_proj"]
lora_dropout: 0.05

Hardware

GPU: P100

Additional Information

Github: Repository
Intro to quantization: Blog
Emergent Feature: Academic
GPTQ Paper: GPTQ
BITSANDBYTES and further LLM.int8()

Acknowledgment

Thanks to @AMerve Noyan for precise intro. Thanks to @HuggungFace Team for the notebook on gptq.

Model Card Authors

Swastik Maiti