--- library_name: transformers tags: - art datasets: - ColumbiaNLP/V-FLUTE language: - en metrics: - f1 --- # Model Card for Model ID This is the checkpoint for the model from the paper [V-FLUTE: Visual Figurative Language Understanding with Textual Explanations](https://arxiv.org/abs/2405.01474). Specifically, it is the best performing fine-tuned model on a combination of V-FLUTE and e-ViL (e-SNLI-VE) datasets with early stopping based on the V-FLUTE validation set. ## Model Details ### Model Description See more on LLaVA 1.5 here: https://github.com/haotian-liu/LLaVA V-FLUTE dataset: https://huggingface.co/datasets/ColumbiaNLP/V-FLUTE V-FLUTE paper: https://arxiv.org/abs/2405.01474 Citation: ``` @misc{saakyan2024understandingfigurativemeaningexplainable, title={Understanding Figurative Meaning through Explainable Visual Entailment}, author={Arkadiy Saakyan and Shreyas Kulkarni and Tuhin Chakrabarty and Smaranda Muresan}, year={2024}, eprint={2405.01474}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2405.01474}, } ``` This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated. - **Developed by:** Arkadiy Saakyan (ColumbiaNLP) - **Model type:** Vision-Language Model - **Language(s) (NLP):** English - **Finetuned from model [optional]:** LLaVA-v1.5 ### Model Sources [optional] - **Repository:** https://github.com/asaakyan/V-FLUTE - **Paper [optional]:** https://arxiv.org/abs/2405.01474 ## Uses The model's intended use is limited to interpreting multimodal figurative inputs such as metaphors, similes, idioms, sarcasm, and humor. ### Out-of-Scope Use The model may not work well for other general instruction-following usecases. [More Information Needed] ## Bias, Risks, and Limitations The V-FLUTE dataset or its source datasets may contain bias, especially in datasets reflecting user-generated distributions (memecap and muse). ### Recommendations Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations. ## How to Get Started with the Model Install LLaVA as described here: https://github.com/asaakyan/LLaVA/tree/6f595efcf2699884f18957ee603986cebfaa9df7 ``` from llava.model.builder import load_pretrained_model from llava.mm_utils import get_model_name_from_path from llava.eval.run_llava_mod import eval_model model_base = "llava-v1.5-7b" model_dir = "llava-v1.5-7b-evil-vflue-v2-lora" model_name = get_model_name_from_path(model_path) tokenizer, model, image_processor, context_len = load_pretrained_model( model_path=model_path, model_base=model_base, model_name=model_name, load_4bit=False ) prompt = """Does the illustration affirm or contest the claim "Feeling motivated and energetic after only cleaning a room minimally."? Provide your argument and choose a label: entailment or contradiction.""" image_file = f"{image_path}/27.png" infer_args = type('Args', (), { "model_name": model_name, "model": model, "tokenizer": tokenizer, "image_processor": image_processor, "query": prompt, "conv_mode": None, "image_file": image_file, "sep": ",", "temperature": 0, "top_p": None, "num_beams": 3, "max_new_tokens": 512 })() output = eval_model(infer_args) print(output) ``` ## Training Details See [here](https://github.com/asaakyan/LLaVA/tree/6f595efcf2699884f18957ee603986cebfaa9df7/scripts/vflute) or [here](https://github.com/asaakyan/V-FLUTE) ### Training Data https://huggingface.co/datasets/ColumbiaNLP/V-FLUTE ## Model Card Contact a.saakyan@cs.columbia.edu