ziyjiang commited on
Commit
1710b8a
1 Parent(s): 71df8f9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +100 -0
README.md CHANGED
@@ -13,3 +13,103 @@ A new checkpoint trained using [llava-v1.6-mistral-7b-hf](https://huggingface.co
13
 
14
  This repo contains the code and data for [VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks](https://arxiv.org/abs/2410.05160). In this paper, we focus on building a unified multimodal embedding model suitable for a wide range of tasks. Our approach is based on transforming an existing, well-trained Vision-Language Model (VLM) into an embedding model. The core idea is to append an [EOS] token at the end of the input sequence, which serves as the representation for the combined multimodal inputs.
15
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
 
14
  This repo contains the code and data for [VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks](https://arxiv.org/abs/2410.05160). In this paper, we focus on building a unified multimodal embedding model suitable for a wide range of tasks. Our approach is based on transforming an existing, well-trained Vision-Language Model (VLM) into an embedding model. The core idea is to append an [EOS] token at the end of the input sequence, which serves as the representation for the combined multimodal inputs.
15
 
16
+ ## Github
17
+ - [Github](https://github.com/TIGER-AI-Lab/VLM2Vec)
18
+
19
+
20
+ ## Data
21
+
22
+ Our model is being trained on MMEB-train and evaluated on MMEB-eval with contrastive learning. We only use in-batch negatives for training.
23
+ Our results on 36 evaluation datasets are:
24
+ ### Train/Eval Data
25
+ - Train data: https://huggingface.co/datasets/TIGER-Lab/MMEB-train
26
+ - Eval data: https://huggingface.co/datasets/TIGER-Lab/MMEB-eval
27
+
28
+
29
+ ## Experimental Results
30
+ VLM2Vec-LlaVa-Next could outperform the baselines and other version of VLM2Vec by a large margin.
31
+
32
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64778fb8168cb428e00f69b0/IaKuKe5ps_bvDTf98C0rt.png)
33
+
34
+
35
+ ## How to use VLM2Vec-LlaVa-Next
36
+
37
+ First you can clone our github
38
+ ```bash
39
+ git clone https://github.com/TIGER-AI-Lab/VLM2Vec.git
40
+ ```
41
+
42
+ Then you can enter the directory to run the following command.
43
+
44
+ from src.model import MMEBModel
45
+ from src.arguments import ModelArguments
46
+ from src.utils import load_processor
47
+
48
+ import torch
49
+ from transformers import HfArgumentParser, AutoProcessor
50
+ from PIL import Image
51
+ import numpy as np
52
+
53
+ model_args = ModelArguments(
54
+ model_name='TIGER-Lab/VLM2Vec-Full',
55
+ pooling='last',
56
+ normalize=True,
57
+ model_backbone='llava')
58
+
59
+ model = MMEBModel.load(model_args)
60
+ model.eval()
61
+ model = model.to('cuda', dtype=torch.bfloat16)
62
+
63
+ processor = load_processor(model_args)
64
+
65
+ # Image + Text -> Text
66
+ inputs = processor('<image_1|> Represent the given image with the following question: What is in the image', [Image.open('figures/example.jpg')])
67
+ inputs = {key: value.to('cuda') for key, value in inputs.items()}
68
+ qry_output = model(qry=inputs)["qry_reps"]
69
+
70
+ string = 'A cat and a dog'
71
+ inputs = processor(string)
72
+ inputs = {key: value.to('cuda') for key, value in inputs.items()}
73
+ tgt_output = model(tgt=inputs)["tgt_reps"]
74
+ print(string, '=', model.compute_similarity(qry_output, tgt_output))
75
+ ## A cat and a dog = tensor([[0.2969]], device='cuda:0', dtype=torch.bfloat16)
76
+
77
+ string = 'A cat and a tiger'
78
+ inputs = processor(string)
79
+ inputs = {key: value.to('cuda') for key, value in inputs.items()}
80
+ tgt_output = model(tgt=inputs)["tgt_reps"]
81
+ print(string, '=', model.compute_similarity(qry_output, tgt_output))
82
+ ## A cat and a tiger = tensor([[0.2080]], device='cuda:0', dtype=torch.bfloat16)
83
+
84
+ # Text -> Image
85
+ inputs = processor('Find me an everyday image that matches the given caption: A cat and a dog.',)
86
+ inputs = {key: value.to('cuda') for key, value in inputs.items()}
87
+ qry_output = model(qry=inputs)["qry_reps"]
88
+
89
+ string = '<|image_1|> Represent the given image.'
90
+ inputs = processor(string, [Image.open('figures/example.jpg')])
91
+ inputs = {key: value.to('cuda') for key, value in inputs.items()}
92
+ tgt_output = model(tgt=inputs)["tgt_reps"]
93
+ print(string, '=', model.compute_similarity(qry_output, tgt_output))
94
+ ## <|image_1|> Represent the given image. = tensor([[0.3105]], device='cuda:0', dtype=torch.bfloat16)
95
+
96
+ inputs = processor('Find me an everyday image that matches the given caption: A cat and a tiger.',)
97
+ inputs = {key: value.to('cuda') for key, value in inputs.items()}
98
+ qry_output = model(qry=inputs)["qry_reps"]
99
+
100
+ string = '<|image_1|> Represent the given image.'
101
+ inputs = processor(string, [Image.open('figures/example.jpg')])
102
+ inputs = {key: value.to('cuda') for key, value in inputs.items()}
103
+ tgt_output = model(tgt=inputs)["tgt_reps"]
104
+ print(string, '=', model.compute_similarity(qry_output, tgt_output))
105
+ ## <|image_1|> Represent the given image. = tensor([[0.2158]], device='cuda:0', dtype=torch.bfloat16)
106
+ ```
107
+
108
+ ## Citation
109
+ ```
110
+ @article{jiang2024vlm2vec,
111
+ title={VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks},
112
+ author={Jiang, Ziyan and Meng, Rui and Yang, Xinyi and Yavuz, Semih and Zhou, Yingbo and Chen, Wenhu},
113
+ journal={arXiv preprint arXiv:2410.05160},
114
+ year={2024}
115
+ }