--- library_name: peft base_model: TheBloke/Llama-2-7b-Chat-GPTQ pipeline_tag: text-generation inference: false license: openrail language: - en datasets: - flytech/python-codes-25k co2_eq_emissions: emissions: 1190 source: >- Quantifying the Carbon Emissions of Machine Learning https://mlco2.github.io/impact#compute training_type: finetuning hardware_used: 1 P100 16GB GPU tags: - text2code - LoRA - GPTQ - Llama-2-7B-Chat - text2python - instruction2code --- # Llama-2-7b-Chat-GPTQ fine-tuned on PYTHON-CODES-25K Generate Python code that accomplishes the task instructed. ## LoRA Adpater Head ### Description Parameter Efficient Finetuning(PEFT) a 4bit quantized Llama-2-7b-Chat from TheBloke/Llama-2-7b-Chat-GPTQ on flytech/python-codes-25k dataset. - **Language(s) (NLP):** English - **License:** openrail - **Qunatization:** GPTQ 4bit - **PEFT:** LoRA - **Finetuned from model [TheBloke/Llama-2-7b-Chat-GPTQ](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GPTQ)** - **Dataset:** [flytech/python-codes-25k](https://huggingface.co/datasets/flytech/python-codes-25k) ## Intended uses & limitations Addressing the efficay of Quantization and PEFT. Implemented as a personal Project. ### How to use ``` The quantized model is finetuned as PEFT. We have the trained Adapter. Merging LoRA adapater with GPTQ quantized model is not yet supported. So instead of loading a single finetuned model, we need to load the base model and merge the finetuned adapter on top. ``` ```python instruction = """"Help me set up my daily to-do list!"""" ``` ```python from peft import PeftModel, PeftConfig from transformers import AutoModelForCausalLM config = PeftConfig.from_pretrained("SwastikM/Llama-2-7B-Chat-text2code") model = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7b-Chat-GPTQ") model = PeftModel.from_pretrained(model, "SwastikM/Llama-2-7B-Chat-text2code") tokenizer = AutoTokenizer.from_pretrained("SwastikM/Llama-2-7B-Chat-text2code") inputs = tokenizer(instruction, return_tensors="pt").input_ids.to('cuda') outputs = model.generate(inputs, max_new_tokens=500, do_sample=False, num_beams=1) code = tokenizer.decode(outputs[0], skip_special_tokens=True) print(code) ``` ### Size Comparison The table shows comparison VRAM requirements for loading and training of FP16 Base Model and 4bit GPTQ quantized model with PEFT. The value for base model referenced from [Model Memory Calculator](https://huggingface.co/docs/accelerate/main/en/usage_guides/model_size_estimator) from HuggingFace | Model | Total Size | Training Using Adam | | ------------------------|-------------| --------------------| | Base Model | 12.37 GB | 49.48 GP | | 4bitQuantized+PEFT | 3.90 GB | 11 GB | ## Training Details ### Training Data ****Dataset:****[gretelai/synthetic_text_to_sql](https://huggingface.co/datasets/gretelai/synthetic_text_to_sql) Trained on `instruction` column of 20,000 randomly shuffled data. ### Training Procedure HuggingFace Accelerate with Training Loop. #### Training Hyperparameters - **Optimizer:** AdamW - **lr:** 2e-5 - **decay:** linear - **batch_size:** 4 - **gradient_accumulation_steps:** 8 - **global_step:** 625 LoraConfig - ***r:*** 8 - ***lora_alpha:*** 32 - ***target_modules:*** ["k_proj","o_proj","q_proj","v_proj"] - ***lora_dropout:*** 0.05 #### Hardware - **GPU:** P100 ## Additional Information - ***Github:*** [Repository]() - ***Intro to quantization:*** [Blog](https://huggingface.co/blog/merve/quantization) - ***Emergent Feature:*** [Academic](https://timdettmers.com/2022/08/17/llm-int8-and-emergent-features) - ***GPTQ Paper:*** [GPTQ](https://arxiv.org/pdf/2210.17323) - ***BITSANDBYTES and further*** [LLM.int8()](https://arxiv.org/pdf/2208.07339) ## Acknowledgment Thanks to [@AMerve Noyan](https://huggingface.co/blog/merve/quantization) for precise intro. Thanks to [@HuggungFace Team](https://colab.research.google.com/drive/1_TIrmuKOFhuRRiTWN94iLKUFu6ZX4ceb?usp=sharing#scrollTo=vT0XjNc2jYKy) for the notebook on gptq. ## Model Card Authors Swastik Maiti