Merge cekal/mpt-7b-peft-compatible
Merges https://huggingface.co/cekal/mpt-7b-peft-compatible by @cekal .
This will add support for peft as well as qlora.
I tested that qlora starts training:
https://github.com/artidoro/qlora/issues/10
git clone https://huggingface.co/mosaicml/mpt-7b
pushd mpt-7b
git fetch origin refs/pr/42:pr/42
git checkout pr/42
popd
python qlora.py \
--model_name_or_path ./mpt-7b \
--trust_remote_code True \
--output_dir /output \
--dataset alpaca \
--do_train True \
--do_eval True \
--do_mmlu_eval True \
--source_max_len 384 \
--target_max_len 128 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 4 \
--logging_steps 10 \
--max_steps 10000 \
--save_strategy steps \
--data_seed 42 \
--save_steps 1000 \
--save_total_limit 40 \
--evaluation_strategy steps \
--eval_dataset_size 1024 \
--max_eval_samples 1000 \
--eval_steps 1000 \
--optim paged_adamw_32bit
Any Differences to #25 ?
Looks pretty similar TBH.
One difference is this line that is needed to work properly with device_map="auto"
:
(Around L290)
outputs = self.transformer(input_ids=input_ids, past_key_values=past_key_values, attention_mask=attention_mask, prefix_mask=prefix_mask, sequence_id=sequence_id, return_dict=return_dict, output_attentions=output_attentions, output_hidden_states=output_hidden_states, use_cache=use_cache, inputs_embeds=inputs_embeds)
last_hidden_state = outputs.last_hidden_state
if self.model_parallel:
last_hidden_state = last_hidden_state.to(self.transformer.wte.weight.device)
logits = F.linear(last_hidden_state, self.transformer.wte.weight)
But that line could also be added there, I suppose.
There might be subtle differences in other places, too, but as I said the code looks pretty similar.
I'm not sure why the additional param inputs_embeds
is needed. Maybe it's being used for something where they already have the embedding? Someone knows?
I made a similar version of this for 30B too on top of the latest foundry changes and it trains with QLORA https://huggingface.co/eluzhnica/mpt-30b-peft-compatible. It does train well from what I've tried.
can do the same thing for the 30b version?
I'm not sure why the additional param
inputs_embeds
is needed. Maybe it's being used for something where they already have the embedding? Someone knows?I made a similar version of this for 30B too on top of the latest foundry changes and it trains with QLORA https://huggingface.co/eluzhnica/mpt-30b-peft-compatible. It does train well from what I've tried.
I tried this and it gives the error:
TypeError: forward() takes 2 positional arguments but 3 were given
I think this is the same error when one sets "--gradient_checkpointing False".
So I know MPT-7B doesn't support gradient checkpointing while using the Huggingface Trainer, but if you set it to false, you get the "TypeError: forward() takes 2 positional arguments but 3 were given" error? Because I have been dealing with that error for weeks now and this might be the breakthrough I needed to convince me to just abandon MPT all together