|
--- |
|
datasets: |
|
- Muennighoff/natural-instructions |
|
language: |
|
- en |
|
library_name: transformers |
|
pipeline_tag: text-generation |
|
tags: |
|
- peft |
|
- llama |
|
--- |
|
# LoRA LLaMA Natural Instructions |
|
|
|
![LlaMA Natural Instructions](./llama-natural-instructions-removebg-preview.png) |
|
|
|
This model is a fine-tuned version of [llama-7b](https://huggingface.co/decapoda-research/llama-7b-hf) from [Meta](https://huggingface.co/facebook), |
|
on the [Natural Instructions](https://huggingface.co/datasets/Muennighoff/natural-instructions) dataset from [AllenAI](https://huggingface.co/allenai), |
|
using the [LoRA](https://arxiv.org/pdf/2106.09685.pdf) training technique. |
|
|
|
⚠️ **This model is for Research purpose only (See the [license](https://huggingface.co/decapoda-research/llama-7b-hf/blob/main/LICENSE))** |
|
|
|
## WandB Report |
|
|
|
Click on the badge below to see the full report on Weights & Biases. |
|
|
|
[![WandB](https://img.shields.io/badge/Weights_&_Biases-FFCC33?style=for-the-badge&logo=WeightsAndBiases&logoColor=black)](https://api.wandb.ai/links/chainyo-mleng/ia2mloow) |
|
|
|
## Usage |
|
|
|
### Installation |
|
|
|
```bash |
|
pip install loralib bitsandbytes datasets git+https://github.com/huggingface/peft.git git+https://github.com/huggingface/transformers.git sentencepiece |
|
``` |
|
|
|
### Format of the input |
|
|
|
The input should be a string of text with the following format: |
|
|
|
```python |
|
prompt_template = { |
|
"prompt": "### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:\n", |
|
"response": "### Response:" |
|
} |
|
|
|
def generate_prompt( |
|
definition: str, |
|
inputs: str, |
|
targets: Union[None, str] = None, |
|
) -> str: |
|
"""Generate a prompt from instruction and input.""" |
|
res = prompt_template["prompt"].format( |
|
instruction=definition, input=inputs |
|
) |
|
|
|
if targets: |
|
res = f"{res}{targets}" |
|
|
|
return res |
|
|
|
def get_response(output: str) -> str: |
|
"""Get the response from the output.""" |
|
return output.split(prompt_template["response"])[1].strip() |
|
``` |
|
|
|
Feel free to use these utility functions to generate the prompt and to extract the response from the model output. |
|
|
|
- `definition` is the instruction describing the task. It's generally a single sentence explaining the expected output and |
|
the reasoning steps to follow. |
|
- `inputs` is the input to the task. It can be a single sentence or a paragraph. It's the context used by the model to |
|
generate the response to the task. |
|
- `targets` is the expected output of the task. It's used for training the model. _It's not required for inference._ |
|
|
|
### Inference |
|
|
|
You can load the model using only the adapters or load the full model with the adapters and the weights. |
|
|
|
#### The tokenizer |
|
|
|
```python |
|
from transformers import LlamaTokenizer |
|
|
|
tokenizer = LlamaTokenizer.from_pretrained("wordcab/llama-natural-instructions-7b") |
|
tokenizer.padding_side = "left" |
|
tokenizer.pad_token_id = (0) |
|
``` |
|
|
|
#### Load the model with the adapters |
|
|
|
```python |
|
from peft import PeftModel |
|
from transformers import LlamaForCausalLM |
|
|
|
model = LlamaForCausalLM.from_pretrained( |
|
"decapoda-research/llama-7b-hf", |
|
load_in_8bit=True, |
|
torch_dtype=torch.float16, |
|
device_map="auto", |
|
) |
|
model = PeftModel.from_pretrained( |
|
model, |
|
"wordcab/llama-natural-instructions-7b", |
|
torch_dtype=torch.float16, |
|
device_map={"": 0}, |
|
) |
|
``` |
|
|
|
#### Load the full model |
|
|
|
⚠️ Work in progress... |
|
|
|
```python |
|
model = LlamaForCausalLM.from_pretrained( |
|
"wordcab/llama-natural-instructions-7b", |
|
load_in_8bit=True, |
|
torch_dtype=torch.float16, |
|
device_map="auto", |
|
) |
|
``` |
|
|
|
#### Evaluation mode |
|
|
|
Don't forget to put the model in evaluation mode. And if you are using PyTorch v2.0 or higher don't forget to call |
|
the compile method. |
|
|
|
```python |
|
model.eval() |
|
if torch.__version__ >= "2": |
|
model = torch.compile(model) |
|
``` |
|
|
|
#### Generate the response |
|
|
|
```python |
|
prompt = generate_prompt( |
|
"In this task, you have to analyze the full sentences and do reasoning and quick maths to find the correct answer.", |
|
f"You are now a superbowl star. You are the quarterback of the team. Your team is down by 3 points. You are in the last 2 minutes of the game. The other team has a score of 28. What is the score of your team?", |
|
) |
|
inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True, max_length=2048) |
|
input_ids = inputs["input_ids"].to(model.device) |
|
|
|
generation_config = GenerationConfig( |
|
temperature=0.2, |
|
top_p=0.75, |
|
top_k=40, |
|
num_beams=4, |
|
) |
|
|
|
with torch.no_grad(): |
|
gen_outputs = model.generate( |
|
input_ids=input_ids, |
|
generation_config=generation_config, |
|
return_dict_in_generate=True, |
|
output_scores=True, |
|
max_new_tokens=50, |
|
) |
|
|
|
s = gen_outputs.sequences[0] |
|
output = tokenizer.decode(s, skip_special_tokens=True) |
|
response = prompter.get_response(output) |
|
print(response) |
|
>>> 25 |
|
``` |
|
|
|
You can try with other prompts that are not maths related as well! :hugs: |
|
|
|
## Beanchmark |
|
|
|
We benchmarked our model on the following tasks: [BoolQ](https://huggingface.co/datasets/boolq), [PIQA](https://huggingface.co/datasets/piqa), [WinoGrande](https://huggingface.co/datasets/winogrande), [OpenBookQA](https://huggingface.co/datasets/openbookqa). |
|
|
|
| | BoolQ | PIQA | WinoGrande | OpenBookQA | Precision | Inference time (s) | |
|
| --- | --- | --- | --- | --- | --- | --- | |
|
| Original LLaMA 7B | 76.5 | 79.8 | 70.1 | 57.2 | fp32 | 3 seconds | |
|
| Original LLaMA 13B | 78.1 | 80.1 | 73 | 56.4 | fp32 | >5 seconds | |
|
| LoRA LLaMA 7B | 63.9 | 51.3 | 48.9 | 31.4 | 8bit | 0.65 seconds | |
|
| LoRA LLaMA 13B | 70 | 63.93 | 51.6 | 50.4 | 8bit | 1.2 seconds | |
|
|
|
__Link to the 13B model:__ [wordcab/llama-natural-instructions-13b](https://huggingface.co/wordcab/llama-natural-instructions-13b) |
|
|
|
Overall our LoRA model is less performant than the original model from Meta, if we compare the results from the [original paper](https://arxiv.org/pdf/2302.13971.pdf). |
|
|
|
The performance degradation is due to the fact we load the model in 8bit and we use the adapters from the LoRA training. |
|
Thanks to the 8bit quantization, the model is 4 times faster than the original model and the results are still decent. |
|
|
|
Some complex tasks like WinoGrande and OpenBookQA are more difficult to solve with the adapters. |
|
|
|
## Training Hardware |
|
|
|
This model was trained on a single NVIDIA RTX 3090 GPU. |