|
--- |
|
license: mit |
|
library_name: peft |
|
tags: |
|
- trl |
|
- dpo |
|
- generated_from_trainer |
|
- distilabel |
|
- argilla |
|
base_model: microsoft/phi-2 |
|
model-index: |
|
- name: phi2-lora-quantized-distilabel-intel-orca-dpo-pairs |
|
results: [] |
|
datasets: |
|
- argilla/distilabel-intel-orca-dpo-pairs |
|
language: |
|
- en |
|
pipeline_tag: text-generation |
|
--- |
|
|
|
<!-- This model card has been generated automatically according to the information the Trainer had access to. You |
|
should probably proofread and complete it, then remove this comment. --> |
|
|
|
# phi2-lora-quantized-distilabel-intel-orca-dpo-pairs |
|
|
|
This model is a fine-tuned version of [microsoft/phi-2](https://huggingface.co/microsoft/phi-2) on [distilabel-intel-orca-dpo-pairs](https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs). |
|
The full training notebook can be found [here](https://colab.research.google.com/drive/1PGMj7jlkJaCiSNNihA2NtpILsRgkRXrJ?usp=sharing). |
|
|
|
It achieves the following results on the evaluation set: |
|
- Loss: 0.4537 |
|
- Rewards/chosen: -0.0837 |
|
- Rewards/rejected: -1.2628 |
|
- Rewards/accuracies: 0.8301 |
|
- Rewards/margins: 1.1791 |
|
- Logps/rejected: -224.8409 |
|
- Logps/chosen: -203.2228 |
|
- Logits/rejected: 0.4773 |
|
- Logits/chosen: 0.3062 |
|
|
|
## Model description |
|
|
|
The adapter was fine-tuned on a Google Colab A100 GPU using DPO and the [distilabel-intel-orca-dpo-pairs](https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs). In order to scale LoRa approached for LLMs, I recommend looking at [predibase/lorax](https://github.com/predibase/lorax). |
|
|
|
You can play around with the model shown below. We load the LoRa adapter and bits_n_bytes config (only when CUDA is available). |
|
|
|
```python |
|
import torch |
|
import torch |
|
from transformers import ( |
|
AutoModelForCausalLM, |
|
AutoTokenizer, |
|
BitsAndBytesConfig |
|
) |
|
from peft import PeftModel |
|
|
|
# template used for fine-tune |
|
# template = """\ |
|
# Instruct: {instruction}\n |
|
# Output: {response}""" |
|
|
|
if torch.cuda.is_available(): |
|
device = torch.device("cuda") |
|
print(f"Using {torch.cuda.get_device_name(0)}") |
|
bnb_config = BitsAndBytesConfig( |
|
load_in_4bit=True, |
|
bnb_4bit_quant_type='nf4', |
|
bnb_4bit_compute_dtype='float16', |
|
bnb_4bit_use_double_quant=False, |
|
) |
|
elif torch.backends.mps.is_available(): |
|
device = torch.device("mps") |
|
bnb_config = None |
|
else: |
|
device = torch.device("cpu") |
|
bnb_config = None |
|
print("No GPU available, using CPU instead.") |
|
|
|
config = PeftConfig.from_pretrained("davidberenstein1957/phi2-lora-quantized-distilabel-intel-orca-dpo-pairs") |
|
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2", torch_dtype=torch.float16, quantization_config=bnb_config) |
|
model = PeftModel.from_pretrained(model, "davidberenstein1957/phi2-lora-quantized-distilabel-intel-orca-dpo-pairs").to(device) |
|
|
|
prompt = "Instruct: What is the capital of France? \nOutput:"" |
|
inputs = tokenizer(prompt, return_tensors="pt", return_attention_mask=False) |
|
|
|
outputs = model.generate(**inputs) |
|
text = tokenizer.batch_decode(outputs)[0] |
|
``` |
|
|
|
## Intended uses & limitations |
|
|
|
This is a LoRa adapter fine-tine for phi-2 and not a full fine-tune of the model. Additionally, I did not spend time updating parameters. |
|
|
|
## Training and evaluation data |
|
|
|
The adapter was fine-tuned on a Google Colab A100 GPU using DPO and the [distilabel-intel-orca-dpo-pairs](https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs). The full training notebook can be found [here](https://colab.research.google.com/drive/1PGMj7jlkJaCiSNNihA2NtpILsRgkRXrJ?usp=sharing). Underneath, there are some configs for the adapter and the trainer. |
|
|
|
```python |
|
peft_config = LoraConfig( |
|
lora_alpha=16, |
|
lora_dropout=0.5, |
|
r=32, |
|
target_modules=['k_proj', 'q_proj', 'v_proj', 'fc1', 'fc2'], |
|
bias="none", |
|
task_type="CAUSAL_LM", |
|
) |
|
``` |
|
|
|
```python |
|
training_arguments = TrainingArguments( |
|
output_dir=f"./{model_name}", |
|
evaluation_strategy="steps", |
|
do_eval=True, |
|
optim="paged_adamw_8bit", |
|
per_device_train_batch_size=2, |
|
gradient_accumulation_steps=16, |
|
per_device_eval_batch_size=2, |
|
log_level="debug", |
|
save_steps=20, |
|
logging_steps=20, |
|
learning_rate=1e-5, |
|
eval_steps=20, |
|
num_train_epochs=1, # Modified for tutorial purposes |
|
max_steps=100, |
|
warmup_steps=20, |
|
lr_scheduler_type="linear", |
|
) |
|
``` |
|
|
|
## Training procedure |
|
|
|
### Training hyperparameters |
|
|
|
The following hyperparameters were used during training: |
|
- learning_rate: 1e-05 |
|
- train_batch_size: 2 |
|
- eval_batch_size: 2 |
|
- seed: 42 |
|
- gradient_accumulation_steps: 16 |
|
- total_train_batch_size: 32 |
|
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 |
|
- lr_scheduler_type: linear |
|
- lr_scheduler_warmup_steps: 20 |
|
- num_epochs: 1 |
|
|
|
### Training results |
|
|
|
| Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen | |
|
|:-------------:|:-----:|:----:|:---------------:|:--------------:|:----------------:|:------------------:|:---------------:|:--------------:|:------------:|:---------------:|:-------------:| |
|
| 0.6853 | 0.06 | 20 | 0.6701 | 0.0133 | -0.0368 | 0.6905 | 0.0501 | -212.5803 | -202.2522 | 0.3853 | 0.2532 | |
|
| 0.6312 | 0.12 | 40 | 0.5884 | 0.0422 | -0.2208 | 0.8138 | 0.2630 | -214.4207 | -201.9638 | 0.4254 | 0.2816 | |
|
| 0.547 | 0.19 | 60 | 0.5146 | 0.0172 | -0.5786 | 0.8278 | 0.5958 | -217.9983 | -202.2132 | 0.4699 | 0.3110 | |
|
| 0.4388 | 0.25 | 80 | 0.4893 | -0.0808 | -1.0789 | 0.8293 | 0.9981 | -223.0014 | -203.1934 | 0.5158 | 0.3396 | |
|
| 0.4871 | 0.31 | 100 | 0.4818 | -0.1298 | -1.2346 | 0.8297 | 1.1048 | -224.5586 | -203.6837 | 0.5133 | 0.3340 | |
|
| 0.4863 | 0.37 | 120 | 0.4723 | -0.1230 | -1.1718 | 0.8301 | 1.0488 | -223.9305 | -203.6159 | 0.4910 | 0.3167 | |
|
| 0.4578 | 0.44 | 140 | 0.4666 | -0.1257 | -1.1772 | 0.8301 | 1.0515 | -223.9844 | -203.6428 | 0.4795 | 0.3078 | |
|
| 0.4587 | 0.5 | 160 | 0.4625 | -0.0746 | -1.1272 | 0.8301 | 1.0526 | -223.4841 | -203.1310 | 0.4857 | 0.3139 | |
|
| 0.4688 | 0.56 | 180 | 0.4595 | -0.0584 | -1.1194 | 0.8297 | 1.0610 | -223.4062 | -202.9692 | 0.4890 | 0.3171 | |
|
| 0.4189 | 0.62 | 200 | 0.4579 | -0.0666 | -1.1647 | 0.8297 | 1.0982 | -223.8598 | -203.0511 | 0.4858 | 0.3138 | |
|
| 0.4392 | 0.68 | 220 | 0.4564 | -0.0697 | -1.1915 | 0.8301 | 1.1219 | -224.1278 | -203.0823 | 0.4824 | 0.3110 | |
|
| 0.4659 | 0.75 | 240 | 0.4554 | -0.0826 | -1.2245 | 0.8301 | 1.1419 | -224.4574 | -203.2112 | 0.4761 | 0.3052 | |
|
| 0.4075 | 0.81 | 260 | 0.4544 | -0.0823 | -1.2328 | 0.8301 | 1.1504 | -224.5403 | -203.2089 | 0.4749 | 0.3044 | |
|
| 0.4015 | 0.87 | 280 | 0.4543 | -0.0833 | -1.2590 | 0.8301 | 1.1757 | -224.8026 | -203.2188 | 0.4779 | 0.3067 | |
|
| 0.4365 | 0.93 | 300 | 0.4539 | -0.0846 | -1.2658 | 0.8301 | 1.1812 | -224.8702 | -203.2313 | 0.4780 | 0.3067 | |
|
| 0.4589 | 1.0 | 320 | 0.4537 | -0.0837 | -1.2628 | 0.8301 | 1.1791 | -224.8409 | -203.2228 | 0.4773 | 0.3062 | |
|
|
|
|
|
### Framework versions |
|
|
|
- PEFT 0.7.1 |
|
- Transformers 4.37.1 |
|
- Pytorch 2.1.0+cu121 |
|
- Datasets 2.16.1 |
|
- Tokenizers 0.15.1 |