nyc-savvy-llama2-7b
Essentials:
- Based on LLaMa2-7b-hf (version 2, 7B params)
- Used QLoRA to fine-tune on 13k rows of /r/AskNYC formatted as Human/Assistant exchanges
- Released the adapter weights
- Merged quantized-then-dequantized LLaMa2 and the adapter weights to produce this full-sized model
Prompt options
Here is the template used in training. Note it starts with "### Human: " (following space), the post title and content, then "### Assistant: " (no preceding space, yes following space).
### Human: Post title - post content### Assistant:
For example:
### Human: Where can I find a good bagel? - We are in Brooklyn### Assistant: Anywhere with fresh-baked bagels and lots of cream cheese options.
From QLoRA's Gradio example, it looks helpful to add a more assistant-like prompt, especially if you follow their lead for a chat format:
A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
Training data
- Collected one month of posts to /r/AskNYC from each year 2015-2019 (no content after July 2019)
- Downloaded from PushShift, accepted comments only if upvote scores >= 3
- Originally collected for my GPT-NYC model in spring 2021 - model / blog
Training script
Takes about 2 hours on CoLab once you get it right. You can only set max_steps for QLoRA, but I wanted to stop at 1 epoch.
git clone https://github.com/artidoro/qlora
cd qlora
pip3 install -r requirements.txt --quiet
python3 qlora.py \
--model_name_or_path ../llama-2-7b-hf \
--use_auth \
--output_dir ../nyc-savvy-llama2-7b \
--logging_steps 10 \
--save_strategy steps \
--data_seed 42 \
--save_steps 500 \
--save_total_limit 40 \
--dataloader_num_workers 1 \
--group_by_length False \
--logging_strategy steps \
--remove_unused_columns False \
--do_train \
--num_train_epochs 1 \
--lora_r 64 \
--lora_alpha 16 \
--lora_modules all \
--double_quant \
--quant_type nf4 \
--bf16 \
--bits 4 \
--warmup_ratio 0.03 \
--lr_scheduler_type constant \
--gradient_checkpointing \
--dataset /content/gpt_nyc.jsonl \
--dataset_format oasst1 \
--source_max_len 16 \
--target_max_len 512 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 16 \
--max_steps 760 \
--learning_rate 0.0002 \
--adam_beta2 0.999 \
--max_grad_norm 0.3 \
--lora_dropout 0.1 \
--weight_decay 0.0 \
--seed 0 \
Merging it back
What you get in the output_dir
is an adapter model. Here's ours. Cool, but not as easy to drop into their script.
Two options for merging:
- The included
peftmerger.py
script merges the adapter and saves the model. - Chris Hayduk produced a script to quantize then de-quantize the base model before merging a QLoRA adapter. This requires bitsandbytes and a GPU.
Testing that the model is NYC-savvy
You might wonder if the model successfully learned anything about NYC or is the same old LLaMa2. With your prompt not adding clues, try this from the pefttester.py
script in this repo:
m = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
tok = LlamaTokenizer.from_pretrained(model_name)
messages = "A chat between a curious human and an assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n"
messages += "### Human: What museums should I visit? - My kids are aged 12 and 5"
messages += "### Assistant: "
input_ids = tok(messages, return_tensors="pt").input_ids
# ...
temperature = 0.7
top_p = 0.9
top_k = 0
repetition_penalty = 1.1
op = m.generate(
input_ids=input_ids,
max_new_tokens=100,
temperature=temperature,
do_sample=temperature > 0.0,
top_p=top_p,
top_k=top_k,
repetition_penalty=repetition_penalty,
stopping_criteria=StoppingCriteriaList([stop]),
)
for line in op:
print(tok.decode(line))
- Downloads last month
- 26