Fine-tuning Phi-3-vision on custom dataset fails

#21
by samyak24jain - opened

Hello, thank you for this incredibly powerful model.
I'm trying to fine-tune Phi3-vision on a custom dataset using LoRA and using this data collator:

class CustomDataCollator:
    def __init__(self, processor):
        self.processor = processor

    def __call__(self, examples):
        texts = []
        images = []
                
        for example in examples:

            question = "sample question text"
            answer = "sample answer text"
            INST_PREFIX='sample instruction prefix'

            messages = [     
                {
                    "role": "user",
                    "content": f"<|image_1|>\n{INST_PREFIX} {question}"
                },
                {
                    "role": "assistant",
                    "content": answer
                }
            ]
            
            text = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
                        
            texts.append(text)
            image_file = Image.open(f"{DATASET_DIR}/{example['image']}")
            images.append(image_file)

        batch = processor(texts[0], images[0], return_tensors="pt")
    
        labels = batch["input_ids"].clone()
        batch["labels"] = labels
        
        return batch

I'm getting the following error during loss calculation which makes me believe there is an issue with the labels (same as input_ids).

Traceback (most recent call last):
  File "/root/.../IMMO-Research/src/train/phi3_trainer.py", line 201, in <module>
    trainer.train()
  File "/opt/conda/envs/phi3/lib/python3.12/site-packages/transformers/trainer.py", line 1885, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/phi3/lib/python3.12/site-packages/transformers/trainer.py", line 2216, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/phi3/lib/python3.12/site-packages/transformers/trainer.py", line 3238, in training_step
    loss = self.compute_loss(model, inputs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/phi3/lib/python3.12/site-packages/transformers/trainer.py", line 3264, in compute_loss
    outputs = model(**inputs)
              ^^^^^^^^^^^^^^^
  File "/opt/conda/envs/phi3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/phi3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/phi3/lib/python3.12/site-packages/accelerate/utils/operations.py", line 822, in forward
    return model_forward(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/phi3/lib/python3.12/site-packages/accelerate/utils/operations.py", line 810, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/phi3/lib/python3.12/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-vision-128k-instruct/dbcdaaacf52c8e40cf8de6d6ffa6ff6860e5f256/modeling_phi3_v.py", line 1332, in forward
    loss = loss_fct(shift_logits, shift_labels)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/phi3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/phi3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/phi3/lib/python3.12/site-packages/torch/nn/modules/loss.py", line 1185, in forward
    return F.cross_entropy(input, target, weight=self.weight,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/phi3/lib/python3.12/site-packages/torch/nn/functional.py", line 3086, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

On further debugging, I see that the processor output (input_ids/labels) has a lot of -1 tokens which might be causing issues when the labels are fed into the cross entropy loss. Cross entropy loss expects values > 0 upto num_classes - 1 (or config.vocab_size - 1 in this case).

How do I fix this issue? Is there something I'm missing?
On a side note, it would be great if you could provide a fine-tuning script for Phi-3-vision-128k-instruct!

I tried the above approach from past few days. The processor function is not optimised to handle batch images, texts and the output it generates for two examples is also very high in memory.

https://huggingface.co/datasets/Magneto/sample/viewer?row=0

@samyak24jain did you fix the error??

Could this be fixed by #16 by @sebbyjp ?

@WilliamSotoM It guess it should

Thanks @bdytx5 ! This is helpful.

@bdytx5 @samyak24jain @WilliamSotoM
I used the code from Wanda.ai blog on dataset preparation and combined it with peft lora but I am getting below error when training using trainer fucntion
Have added link to download the dataset file(mars_dataset.csv) and original dataset is available on hugging face :-

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-8-3435b262f1ae> in <cell line: 1>()
----> 1 trainer.train()

7 frames
/usr/local/lib/python3.10/dist-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1910                 hf_hub_utils.enable_progress_bars()
1911         else:
-> 1912             return inner_training_loop(
1913                 args=args,
1914                 resume_from_checkpoint=resume_from_checkpoint,

/usr/local/lib/python3.10/dist-packages/transformers/trainer.py in _inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
2208 
2209             step = -1
-> 2210             for step, inputs in enumerate(epoch_iterator):
2211                 total_batched_samples += 1
2212 

/usr/local/lib/python3.10/dist-packages/accelerate/data_loader.py in __iter__(self)
    452         # We iterate one batch ahead to check when we are at the end
    453         try:
--> 454             current_batch = next(dataloader_iter)
    455         except StopIteration:
    456             yield

/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py in __next__(self)
    629                 # TODO(https://github.com/pytorch/pytorch/issues/76750)
    630                 self._reset()  # type: ignore[call-arg]
--> 631             data = self._next_data()
    632             self._num_yielded += 1
    633             if self._dataset_kind == _DatasetKind.Iterable and \

/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py in _next_data(self)
    673     def _next_data(self):
    674         index = self._next_index()  # may raise StopIteration
--> 675         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    676         if self._pin_memory:
    677             data = _utils.pin_memory.pin_memory(data, self._pin_memory_device)

/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
    52         else:
    53             data = self.dataset[possibly_batched_index]
---> 54         return self.collate_fn(data)

/usr/local/lib/python3.10/dist-packages/transformers/data/data_collator.py in default_data_collator(features, return_tensors)
    90 
    91     if return_tensors == "pt":
---> 92         return torch_default_data_collator(features)
    93     elif return_tensors == "tf":
    94         return tf_default_data_collator(features)

/usr/local/lib/python3.10/dist-packages/transformers/data/data_collator.py in torch_default_data_collator(features)
    152         if k not in ("label", "label_ids") and v is not None and not isinstance(v, str):
    153             if isinstance(v, torch.Tensor):
--> 154                 batch[k] = torch.stack([f[k] for f in features])
    155             elif isinstance(v, np.ndarray):
    156                 batch[k] = torch.tensor(np.stack([f[k] for f in features]))

RuntimeError: stack expects each tensor to be equal size, but got [1523, 656, 3] at entry 0 and [583, 571, 3] at entry 1

Following is the code used to do peft lora based finetuning :-

@bdytx5 @samyak24jain @WilliamSotoM
I used the code from Wanda.ai blog on dataset preparation and combined it with peft lora but I am getting below error when training using trainer fucntion
Have added link to download the dataset file(mars_dataset.csv) and original dataset is available on hugging face :-

from google.colab import drive
drive.mount('/content/drive')

!pip install -q git+https://github.com/huggingface/transformers.git
!pip install -q accelerate datasets peft bitsandbytes flash_attn


# Import necessary libraries
from PIL import Image
import requests
from transformers import AutoModelForCausalLM
from transformers import AutoProcessor
from transformers import BitsAndBytesConfig
from transformers import TrainingArguments, Trainer
from peft import LoraConfig
import torch
import pandas as pd
import numpy as np

DEVICE = "cuda:0"
# Define model ID
checkpoint = "microsoft/Phi-3-vision-128k-instruct"

# Load processor
processor = AutoProcessor.from_pretrained(checkpoint, trust_remote_code=True)

# Define BitsAndBytes configuration for 4-bit quantization
nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

lora_config = LoraConfig(
        r=8,
        lora_alpha=8,
        lora_dropout=0.1,
        target_modules=["q_proj", "k_proj", "v_proj"],
        use_dora=False,
        init_lora_weights="gaussian"
    )

# Load model with 4-bit quantization and map to CUDA
model = AutoModelForCausalLM.from_pretrained(
    checkpoint,
    device_map="cuda",
    trust_remote_code=True,
    torch_dtype="auto",
    quantization_config=nf4_config,
)

model.add_adapter(lora_config)
model.enable_adapters()


model_name = checkpoint.split("/")[1]

from torch.utils.data import Dataset, DataLoader, random_split
processor = AutoProcessor.from_pretrained(checkpoint, trust_remote_code=True)
tokenizer = processor.tokenizer

# Custom Dataset for Mars Images
class MarsProductDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_length, image_size):
        self.dataframe = dataframe
        self.tokenizer = tokenizer
        self.tokenizer.padding_side = 'left'
        self.max_length = max_length

    def __len__(self):
        return len(self.dataframe)


    def __getitem__(self, idx):
        row = self.dataframe.iloc[idx]
        text = f"<|user|>\n<|image_1|>What is shown in this image?<|end|><|assistant|>\nCaption: {row['short_caption']}<|end|>"
        image_path = row['local_image_path']

        # Tokenize text
        encodings = self.tokenizer(text, truncation=True, padding='max_length', max_length=self.max_length)

        try:
            # Load and transform image
            image = Image.open(image_path).convert("RGB")
            image = self.image_transform_function(image)
        except (FileNotFoundError, IOError):
            # Skip the sample if the image is not found
            return None

        encodings['pixel_values'] = image
        #encodings['price'] = row['full_price']

        return {key: torch.tensor(val) for key, val in encodings.items()}


    def image_transform_function(self, image):
        image = np.array(image)
        return image

# Code to prepare the dataset-
# # Function to download an image from a URL and save it locally
# def download_image(image_url, save_path):
#     try:
#         response = requests.get(image_url)
#         response.raise_for_status()  # Check if the request was successful
#         image = Image.open(BytesIO(response.content))
#         image.save(save_path)
#         return True
#     except Exception as e:
#         print(f"Failed to download {image_url}: {e}")
#         return False

# # Load the dataset from Hugging Face
# dataset = load_dataset('Magneto/image_for_mars')

# # Convert the Hugging Face dataset to a Pandas DataFrame
# df = dataset['train'].to_pandas()

# import os
# import pandas as pd
# from tqdm import tqdm

# # Create directories to save the dataset and images
# dataset_dir = '/content/drive/MyDrive/Nasa_Phi3_Vision_Finetuning/data/mars_dataset'
# images_dir = os.path.join(dataset_dir, 'images')
# os.makedirs(images_dir, exist_ok=True)

# # Filter out rows where image download fails
# filtered_rows = []
# for idx, row in tqdm(df.iterrows(), total=len(df), desc="Downloading images"):
#     image_url = row['image_url']
#     image_name = f"{idx}.jpg"
#     image_path = os.path.join(images_dir, image_name)
#     if download_image(image_url, image_path):
#         row['local_image_path'] = image_path
#         filtered_rows.append(row)

# # Create a new DataFrame with the filtered rows
# filtered_df = pd.DataFrame(filtered_rows)

# # Save the updated dataset to disk
# dataset_path = os.path.join(dataset_dir, 'mars_dataset.csv')
# filtered_df.to_csv(dataset_path, index=False)

# print(f"Dataset and images saved to {dataset_dir}")
# Load dataset from disk
# link for the file- "https://drive.google.com/file/d/17NelvfLTy13dHU0CXxV0iDGULqaCXctL/view?usp=share_link"
dataset_path = '/content/drive/MyDrive/Nasa_Phi3_Vision_Finetuning/data/mars_dataset/mars_dataset.csv'
df = pd.read_csv(dataset_path)


# Split dataset into training and validation sets
train_size = int(0.998 * len(df))
print(train_size)
val_size = len(df) - train_size
print(val_size)

train_indices, val_indices = random_split(range(len(df)), [train_size, val_size])
train_indices = train_indices.indices
val_indices = val_indices.indices
train_df = df.iloc[train_indices]
val_df = df.iloc[val_indices]


# Create dataset and dataloader
train_dataset = MarsProductDataset(train_df, tokenizer, max_length=512, image_size=128)
val_dataset = MarsProductDataset(val_df, tokenizer, max_length=512, image_size=128)


train_loader = DataLoader(train_dataset, batch_size=1, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=1, shuffle=False)


training_args = TrainingArguments(
    num_train_epochs=2,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=8,
    dataloader_pin_memory = False,
    learning_rate=1e-4,
    weight_decay=0.01,
    logging_steps=5,
    save_total_limit = 3,
    output_dir = f"/content/drive/MyDrive/Nasa_Phi3_Vision_Finetuning/{model_name}-Mars-Rover",
    eval_steps = 10,
    save_steps = 25,
    max_steps = 25,
    evaluation_strategy="steps",
    fp16=True,
    remove_unused_columns=False,
    report_to="none",
    label_names = ["labels"],
    load_best_model_at_end = False,
    optim = "paged_adamw_8bit",
    lr_scheduler_type='linear',
    warmup_steps=100,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_loader,
    eval_dataset=val_loader, # You can also evaluate (loss) on the eval set, note that it will incur some additional GPU memory
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_loader,
    eval_dataset=val_loader, # You can also evaluate (loss) on the eval set, note that it will incur some additional GPU memory
)



trainer.train()

can someone help? > i have the same error

@bdytx5 @samyak24jain @WilliamSotoM @digitalesingulary @Magneto

You could use this code for fine-tuning the model!

https://github.com/2U1/Phi3-Vision-ft

You could use this code. It also has the options to tune img_projector and vision_model together like llava-1.6.

Got stuck with the same "The processor function is not optimised to handle batch images, texts ", so have to prepare the dataset in dataloader format.
@Magneto Thanks a lot for the code, was really helpful,

But while trying to recreate your code results, got error

    49                 data = self.dataset.__getitems__(possibly_batched_index)
     50             else:
---> 51                 data = [self.dataset[idx] for idx in possibly_batched_index]
     52         else:
     53             data = self.dataset[possibly_batched_index]

TypeError: 'DataLoader' object is not subscriptable

i`ve installed all latest version

!pip install -q git+https://github.com/huggingface/transformers.git
!pip install -q accelerate datasets peft bitsandbytes flash_attn

Can someone please help me out why trainer function doesn't accept Data Loader as train dataset..

is this due to the version issues ?

@EphronM As you can see here, the Trainer class in huggingface takes dataset as input. I think you should change to dataset again.

https://huggingface.co/docs/transformers/main_classes/trainer#api-reference%20][%20transformers.Trainer

Microsoft org

Thank you all your interest in Phi-3 Vision model.
You may want to try the official finetuning recipe https://github.com/microsoft/Phi-3CookBook/blob/main/md/04.Fine-tuning/FineTuning_Vision.md

nguyenbh changed discussion status to closed

Sign up or log in to comment