Image-Text-to-Text
Transformers
Safetensors
English
idefics2
pretraining
multimodal
vision
Inference Endpoints
5 papers

Mixing text-only data into fine-tuning

#68
by bilibraker - opened

I would like to add some text-only data into my fine-tuning dataset (which has images).
How can I mix my text-only data with the regular image-text data?

I know that Idefics2 can take text-only data as an input, but I want to create a mix on the batch level.

I'm currently using the below DataCollator for processing the usual image-text data:

class MyDataCollator:
    def __init__(self, processor):
        self.processor = processor
        self.image_token_id = self.processor.tokenizer.additional_special_tokens_ids[
            self.processor.tokenizer.additional_special_tokens.index("<image>")
        ]

    def __call__(self, examples):
        texts = []
        images = []
        for example in examples:
            image = example["images"][0]
            if image is None:
                continue
            for example_text in example["texts"]:
                question = example_text["user"]
                answer = example_text["assistant"]
                messages = [
                    {
                        "role": "user",
                        "content": [
                            {"type": "text", "text": "Answer briefly."},
                            {"type": "image"},
                            {"type": "text", "text": question}
                        ]
                    },
                    {
                        "role": "assistant",
                        "content": [
                            {"type": "text", "text": answer}
                        ]
                    }
                ]
                text = self.processor.apply_chat_template(messages, add_generation_prompt=False)
                # added for the base model
                text = text.replace("<end_of_utterance>", "")
                texts.append(text.strip())
                images.append([image])

        batch = self.processor(text=texts, images=images, return_tensors="pt", padding=True)

        labels = batch["input_ids"].clone()
        labels[labels == self.processor.tokenizer.pad_token_id] = self.image_token_id
#        labels[labels == self.processor.tokenizer.pad_token_id] = -100
#        labels[labels == self.image_token_id] = -100
        batch["labels"] = labels

        return batch

I thought about adding None-s into the images list, but it gives the error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[101], line 2
      1 data_collator = MyDataCollatorTheCauldron(processor)
----> 2 collated_text = data_collator.__call__(sumjpn_data_50k[:10])

Cell In[100], line 67
     62         images.append([image])
     63 #if image is None:
     64 #batch = self.processor(text=texts, images=images, return_tensors="pt", padding=True)
     65 #    batch = self.processor(text=texts, return_tensors="pt", padding=True)
     66 #else:
---> 67 batch = self.processor(text=texts, images=images, return_tensors="pt", padding=True)
     69 labels = batch["input_ids"].clone()
     70 labels[labels == self.processor.tokenizer.pad_token_id] = self.image_token_id

File ~/miniconda3/envs/vdu/lib/python3.10/site-packages/transformers/models/idefics2/processing_idefics2.py:230, in Idefics2Processor.__call__(self, text, images, image_seq_len, padding, truncation, max_length, is_split_into_words, add_special_tokens, return_tensors)
    225     raise ValueError(
    226         f"The number of images in the text {n_images_in_text} and images  {n_images_in_images} should be the same."
    227     )
    229 # Load images if they are URLs
--> 230 images = [[load_image(im) for im in sample] for sample in images]
    231 image_inputs = self.image_processor(images, return_tensors=return_tensors)
    232 inputs.update(image_inputs)

File ~/miniconda3/envs/vdu/lib/python3.10/site-packages/transformers/models/idefics2/processing_idefics2.py:230, in <listcomp>(.0)
    225     raise ValueError(
    226         f"The number of images in the text {n_images_in_text} and images  {n_images_in_images} should be the same."
    227     )
    229 # Load images if they are URLs
--> 230 images = [[load_image(im) for im in sample] for sample in images]
    231 image_inputs = self.image_processor(images, return_tensors=return_tensors)
    232 inputs.update(image_inputs)

File ~/miniconda3/envs/vdu/lib/python3.10/site-packages/transformers/models/idefics2/processing_idefics2.py:230, in <listcomp>(.0)
    225     raise ValueError(
    226         f"The number of images in the text {n_images_in_text} and images  {n_images_in_images} should be the same."
    227     )
    229 # Load images if they are URLs
--> 230 images = [[load_image(im) for im in sample] for sample in images]
    231 image_inputs = self.image_processor(images, return_tensors=return_tensors)
    232 inputs.update(image_inputs)

File ~/miniconda3/envs/vdu/lib/python3.10/site-packages/transformers/image_utils.py:332, in load_image(image, timeout)
    330     image = image
    331 else:
--> 332     raise ValueError(
    333         "Incorrect format used for image. Should be an url linking to an image, a base64 string, a local path, or a PIL image."
    334     )
    335 image = PIL.ImageOps.exif_transpose(image)
    336 image = image.convert("RGB")

ValueError: Incorrect format used for image. Should be an url linking to an image, a base64 string, a local path, or a PIL image.

Do you have any ideas?

hi @bilibraker
can you say more about "adding text-only data"?
what you are showing is fine-tuning under the dialogue format indeed so how about adding the text in the user input?

Let's say we have a batch of 5 data samples with 3 being image-text and 2 being text-only with the following schemas:

image-text data schema (as in The Cauldron)

{
    "images" = [PIL.Image]
    "texts" = [
        {
            "user": "Question: How many actions are depicted in the diagram?\nChoices:\nA. 6.\nB. 4.\nC. 8.\nD. 7.\nAnswer with the letter.",
            "assistant": "Answer: D",
            "source": "TQA"
        }
    ]
}

text-only data schema (the difference is only the "images" key)

{
    "images" = None
    "texts" = [
        {
            "user": "Question: How many actions are depicted in the diagram?\nChoices:\nA. 6.\nB. 4.\nC. 8.\nD. 7.\nAnswer with the letter.",
            "assistant": "Answer: D",
            "source": "TQA"
        }
    ]
}

I would like to feed this mixed batch of text-only and image-text data to a DataCollator that can process both types of data samples.

@VictorSanh I temporarily solved the issue by adding empty images to the text-only instances, but I'm still curious about a more robust solution.
Also, how did you solve this issue when training Idefics2 ? Its training data also contains both text-only and image-text data.

Sign up or log in to comment