Mixing text-only data into fine-tuning

#68

by bilibraker - opened Jun 7, 2024

Jun 7, 2024

I would like to add some text-only data into my fine-tuning dataset (which has images).
How can I mix my text-only data with the regular image-text data?

I know that Idefics2 can take text-only data as an input, but I want to create a mix on the batch level.

I'm currently using the below DataCollator for processing the usual image-text data:

class MyDataCollator:
    def __init__(self, processor):
        self.processor = processor
        self.image_token_id = self.processor.tokenizer.additional_special_tokens_ids[
            self.processor.tokenizer.additional_special_tokens.index("<image>")
        ]

    def __call__(self, examples):
        texts = []
        images = []
        for example in examples:
            image = example["images"][0]
            if image is None:
                continue
            for example_text in example["texts"]:
                question = example_text["user"]
                answer = example_text["assistant"]
                messages = [
                    {
                        "role": "user",
                        "content": [
                            {"type": "text", "text": "Answer briefly."},
                            {"type": "image"},
                            {"type": "text", "text": question}
                        ]
                    },
                    {
                        "role": "assistant",
                        "content": [
                            {"type": "text", "text": answer}
                        ]
                    }
                ]
                text = self.processor.apply_chat_template(messages, add_generation_prompt=False)
                # added for the base model
                text = text.replace("<end_of_utterance>", "")
                texts.append(text.strip())
                images.append([image])

        batch = self.processor(text=texts, images=images, return_tensors="pt", padding=True)

        labels = batch["input_ids"].clone()
        labels[labels == self.processor.tokenizer.pad_token_id] = self.image_token_id
#        labels[labels == self.processor.tokenizer.pad_token_id] = -100
#        labels[labels == self.image_token_id] = -100
        batch["labels"] = labels

        return batch

I thought about adding None-s into the images list, but it gives the error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[101], line 2
      1 data_collator = MyDataCollatorTheCauldron(processor)
----> 2 collated_text = data_collator.__call__(sumjpn_data_50k[:10])

Cell In[100], line 67
     62         images.append([image])
     63 #if image is None:
     64 #batch = self.processor(text=texts, images=images, return_tensors="pt", padding=True)
     65 #    batch = self.processor(text=texts, return_tensors="pt", padding=True)
     66 #else:
---> 67 batch = self.processor(text=texts, images=images, return_tensors="pt", padding=True)
     69 labels = batch["input_ids"].clone()
     70 labels[labels == self.processor.tokenizer.pad_token_id] = self.image_token_id

File ~/miniconda3/envs/vdu/lib/python3.10/site-packages/transformers/models/idefics2/processing_idefics2.py:230, in Idefics2Processor.__call__(self, text, images, image_seq_len, padding, truncation, max_length, is_split_into_words, add_special_tokens, return_tensors)
    225     raise ValueError(
    226         f"The number of images in the text {n_images_in_text} and images  {n_images_in_images} should be the same."
    227     )
    229 # Load images if they are URLs
--> 230 images = [[load_image(im) for im in sample] for sample in images]
    231 image_inputs = self.image_processor(images, return_tensors=return_tensors)
    232 inputs.update(image_inputs)

File ~/miniconda3/envs/vdu/lib/python3.10/site-packages/transformers/models/idefics2/processing_idefics2.py:230, in <listcomp>(.0)
    225     raise ValueError(
    226         f"The number of images in the text {n_images_in_text} and images  {n_images_in_images} should be the same."
    227     )
    229 # Load images if they are URLs
--> 230 images = [[load_image(im) for im in sample] for sample in images]
    231 image_inputs = self.image_processor(images, return_tensors=return_tensors)
    232 inputs.update(image_inputs)

File ~/miniconda3/envs/vdu/lib/python3.10/site-packages/transformers/models/idefics2/processing_idefics2.py:230, in <listcomp>(.0)
    225     raise ValueError(
    226         f"The number of images in the text {n_images_in_text} and images  {n_images_in_images} should be the same."
    227     )
    229 # Load images if they are URLs
--> 230 images = [[load_image(im) for im in sample] for sample in images]
    231 image_inputs = self.image_processor(images, return_tensors=return_tensors)
    232 inputs.update(image_inputs)

File ~/miniconda3/envs/vdu/lib/python3.10/site-packages/transformers/image_utils.py:332, in load_image(image, timeout)
    330     image = image
    331 else:
--> 332     raise ValueError(
    333         "Incorrect format used for image. Should be an url linking to an image, a base64 string, a local path, or a PIL image."
    334     )
    335 image = PIL.ImageOps.exif_transpose(image)
    336 image = image.convert("RGB")

ValueError: Incorrect format used for image. Should be an url linking to an image, a base64 string, a local path, or a PIL image.

Do you have any ideas?

VictorSanh

Jun 7, 2024

hi @bilibraker
can you say more about "adding text-only data"?
what you are showing is fine-tuning under the dialogue format indeed so how about adding the text in the user input?

bilibraker

Jun 10, 2024

Let's say we have a batch of 5 data samples with 3 being image-text and 2 being text-only with the following schemas:

image-text data schema (as in The Cauldron)

{
    "images" = [PIL.Image]
    "texts" = [
        {
            "user": "Question: How many actions are depicted in the diagram?\nChoices:\nA. 6.\nB. 4.\nC. 8.\nD. 7.\nAnswer with the letter.",
            "assistant": "Answer: D",
            "source": "TQA"
        }
    ]
}

text-only data schema (the difference is only the "images" key)

{
    "images" = None
    "texts" = [
        {
            "user": "Question: How many actions are depicted in the diagram?\nChoices:\nA. 6.\nB. 4.\nC. 8.\nD. 7.\nAnswer with the letter.",
            "assistant": "Answer: D",
            "source": "TQA"
        }
    ]
}

I would like to feed this mixed batch of text-only and image-text data to a DataCollator that can process both types of data samples.

bilibraker

Jun 17, 2024

@VictorSanh I temporarily solved the issue by adding empty images to the text-only instances, but I'm still curious about a more robust solution.
Also, how did you solve this issue when training Idefics2 ? Its training data also contains both text-only and image-text data.

xingpng

Aug 21, 2024

@VictorSanh I temporarily solved the issue by adding empty images to the text-only instances, but I'm still curious about a more robust solution.
Also, how did you solve this issue when training Idefics2 ? Its training data also contains both text-only and image-text data.

Same question, is the empty image valid?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment