Does MiniCPM support multi-image input?

#2
by huanghui1997 - opened

I want to process 4-6 images each time, what is the best practice?

OpenBMB org

Here is a common practice, input messages like this, 'system prompt, image 1... image n, question'.

But, any kind of sequence can be input to the model in the following way, and you can try to find the best way to do so.

    msgs = []
    system_prompt = 'Answer in detail.'
    prompt = 'Caption this two images'
    tgt_path = ['path/to/image1', 'path/to/image/2']
    if system_prompt: 
        msgs.append(dict(type='text', value=system_prompt))
    if isinstance(tgt_path, list):
        msgs.extend([dict(type='image', value=p) for p in tgt_path])
    else:
        msgs = [dict(type='image', value=tgt_path)]
    msgs.append(dict(type='text', value=prompt))

    content = []
    for x in msgs:
        if x['type'] == 'text':
            content.append(x['value'])
        elif x['type'] == 'image':
            image = Image.open(x['value']).convert('RGB')
            content.append(image)
    msgs = [{'role': 'user', 'content': content}]

    res = model.chat(
        msgs=msgs,
        context=None,
        image=None,
        tokenizer=self.tokenizer,
        **default_kwargs
    )

If you have more questions, feel free to continue the discussion.

@Cuiunbo hi I wrote according to the official script, but when I passed in the image, I ran an error message like "Segmentation fault (core dumped)"

OpenBMB org

Hi @vjunyang , you can help us reproduce your error by providing more information about the error and using the environment and code.

@Cuiunbo
image.png
My environment:python==3.8,sentencepiece==0.1.99, torch==2.2.0, Pillow==10.1.0,torchvision==0.16.2,transformers==4.40.2, CUDA Version: 12.2

Operation information:image.png

Solved, I added it in the code torch.backends.cudnn.enabled = False

OpenBMB org

nice! Glad you made this work, and feel free to ask if you have more questions!
We'll get back to you as soon as we can.

Cuiunbo changed discussion status to closed

@Cuiunbo It looks like the model works only with 2 images maximum. I've tried it with 2 images and it worked perfectly fine, but with any number of images more than 2 it just gives you a
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 21 but got size 20 for tensor number 1 in the list.
image.png

OpenBMB org

@ma-korotkov Thanks for providing the implementation, in case you don't modify the model file, the max context length for llama3 is 2048, but i remember llama3 supports 4096, you can try to modify it.
Also, the resolution of the image affects the number of images you can input into the model, you can also try resizing it before that.

@Cuiunbo Thanks a lot! You are right, I'm using quite a big images, so resizing them have helped to fit into context length

OpenBMB org

@Cuiunbo Thanks a lot! You are right, I'm using quite a big images, so resizing them have helped to fit into context length

@ma-korotkov Nice! I hope to hear your feedback on our video capabilities, Since we didn't train on multiimages data, it's amazing if we can do some simple video tasks now!

Cuiunbo changed discussion status to open

May I ask if you have added the training data of interleaving pictures and text? I found that it did not learn well when using multiple pictures and texts (pairs) as the context.

@Cuiunbo Can you specify how to increase the context length in this model. It would be really helpful.

OpenBMB org

@maniache
Hello, we have not added interleaving data

OpenBMB org

@Rasi1610 hi, you may need to edit the tokenizer config for context length

Hi

What is the prefered way to continue the "chat" regarding a previously loaded image without reloading?

Is it possible to use the model just as a language model, ie. without any image?

Here is a common practice, input messages like this, 'system prompt, image 1... image n, question'.

But, any kind of sequence can be input to the model in the following way, and you can try to find the best way to do so.

    msgs = []
    system_prompt = 'Answer in detail.'
    prompt = 'Caption this two images'
    tgt_path = ['path/to/image1', 'path/to/image/2']
    if system_prompt: 
        msgs.append(dict(type='text', value=system_prompt))
    if isinstance(tgt_path, list):
        msgs.extend([dict(type='image', value=p) for p in tgt_path])
    else:
        msgs = [dict(type='image', value=tgt_path)]
    msgs.append(dict(type='text', value=prompt))

    content = []
    for x in msgs:
        if x['type'] == 'text':
            content.append(x['value'])
        elif x['type'] == 'image':
            image = Image.open(x['value']).convert('RGB')
            content.append(image)
    msgs = [{'role': 'user', 'content': content}]

    res = model.chat(
        msgs=msgs,
        context=None,
        image=None,
        tokenizer=self.tokenizer,
        **default_kwargs
    )

If you have more questions, feel free to continue the discussion.

image=None? 图片为None?

Sign up or log in to comment