--- license: mit language: - en library_name: transformers --- # Model Card for MMICL ## Temporal Demo for MMICL [Playground for MMICL-FLANT5XXL](https://ddb87ac77b2611b779.gradio.live/) support multi-image input as well as video input. ## Model Details **MMICL(Multi-Modal In-Context Learning)** is a multimodal vision-language model that incorporates blip2/instrcutblip. It has the ability to analyze and understand multiple images, as well as follow instructions. ### Model Description MMICL outperforms the VL model of the same size and performs exceptionally well on complex visual reasoning datasets. Till 21st Aug. 2023, it achieves **state-of-the-art** performance on both multimodal task leaderboards and a wide range of vision-language tasks. Furthermore, it showcases new capabilities in video understanding and multimodal in-context learning (M-ICL). + **Capability of multiple images refering and reasoning** + **Manually constructed In-context instruction tuning dataset** + Till 21st Aug. 2023 **1st on [MME](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation), 1st on [MMBench](https://opencompass.org.cn/leaderboard-multimodal)** + Visual Encoder: VIT-L from CLIP/ ViT-G/14 from EVA-CLIP + Pre-trained LLM: FlanT5-XL/ FlanT5-XXL/ Vicuna-7B/ Vicuna-13B - **Developed by:** [More Information Needed] - **License:** MIT - **Finetuned from model :** [instructblip-flan-t5-xxl](https://huggingface.co/Salesforce/instructblip-flan-t5-xxl) - **Repository:** [MMICL](https://github.com/HaozheZhao/MIC) ## How to Get Started with the Model ``` # For T5 based model from model.instructblip import InstructBlipConfig, InstructBlipModel, InstructBlipPreTrainedModel,InstructBlipForConditionalGeneration,InstructBlipProcessor import datasets import json import transformers from PIL import Image import torch from model.blip2 import Blip2Processor,Blip2ForConditionalGeneration from model.blip2 import Blip2Config model_type="instructblip" model_ckpt="BleachNick/MMICL-Instructblip-T5-xxl" if 'blip2' in model_type: model = Blip2ForConditionalGeneration.from_pretrained( model_ckpt, config=config).to('cuda:0',dtype=torch.bfloat16) elif 'instructblip' in model_type: model = InstructBlipForConditionalGeneration.from_pretrained( model_ckpt, config=config).to('cuda:0',dtype=torch.bfloat16) sp = ["图"]+[f"" for i in range(20)] processor = InstructBlipProcessor.from_pretrained( model_ckpt ) # processor = Blip2Processor.from_pretrained( # model_ckpt # ) sp = sp+processor.tokenizer.additional_special_tokens[len(sp):] processor.tokenizer.add_special_tokens({'additional_special_tokens':sp}) prompt = ['Use the image 0: 图,image 1: 图 and image 2: 图 as a visual aid to help you calculate the equation accurately. image 0 is 2+1=3.\nimage 1 is 5+6=11.\nimage 2 is"'] prompt = " ".join(prompt) inputs = processor(images=images, text=prompt, return_tensors="pt") inputs['pixel_values'] = inputs['pixel_values'].to(torch.bfloat16) inputs['img_mask'] = torch.tensor([[1 for i in range(len(images))]]) inputs['pixel_values'] = inputs['pixel_values'].unsqueeze(0) inputs = inputs.to('cuda:0') outputs = model.generate( pixel_values = inputs['pixel_values'], input_ids = inputs['input_ids'], attention_mask = inputs['attention_mask'], img_mask = inputs['img_mask'] ) generated_text = processor.batch_decode(outputs, skip_special_tokens=True)[0].strip() print(generated_text) ``` #### Training Hyperparameters - **Training regime:** [fp32, bf16 mixed precision, bf16 non-mixed precision]