File size: 2,923 Bytes
45d883d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2616ac8
 
45d883d
 
 
 
 
 
 
 
 
2616ac8
45d883d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2616ac8
 
 
 
 
 
45d883d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
---
datasets:
- MMInstruction/VLFeedback
---
# Model Card for Silkie

<!-- Provide a quick summary of what the model is/does. -->

Silkie is a visual language model trained using preference distillation on GPT-4V annotated AI feedback. It is a fine-tuned version of [Qwen/Qwen-VL-Chat](https://huggingface.co/Qwen/Qwen-VL-Chat) and was trained on our [MMInstruction/VLFeedback](https://huggingface.co/datasets/MMInstruction/VLFeedback) dataset with direct preference optimization (DPO). Silkie is a visual language model trained by preference distillation on GPT-4V annotated AI feedback. It is a fine-tuned version of [Qwen/Qwen-VL-Chat](https://huggingface.co/Qwen/Qwen-VL-Chat) that is trained on our [MMInstruction/VLFeedback](https://huggingface.co/datasets/MMInstruction/VLFeedback) dataset with direct preference optimization (DPO). Compared with the original model, Silkile achieves 6.9% and 9.5% relative improvement on the MME benchmark regarding the perception and cognition capabilities, respectively. Besides, Silkie sets a new state-of-the-art score of 3.02 on MMHal-Bench regarding hallucination evaluation. Please refer to our [project page](https://vlf-silkie.github.io/) for more details.

## Model Sources

<!-- Provide the basic links for the model. -->

- **Project page:** https://vlf-silkie.github.io/
- **Dataset:** https://huggingface.co/datasets/MMInstruction/VLFeedback
- **Paper:** https://arxiv.org/abs/2312.10665
- **Repository:** https://github.com/vlf-silkie/VLFeedback

## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

Silkie is intended for research purposes, particularly for alignment research in multimodal models.

## How to Get Started

Below is a simple Python code snippet to get started with the model. For installation instructions please refer to our [Github repository](https://github.com/vlf-silkie/VLFeedback?tab=readme-ov-file#installation).

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "MMInstruction/Silkie", trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    "MMInstruction/Silkie", device_map="cuda", trust_remote_code=True
).eval()
query = tokenizer.from_list_format(
    [
        {"image": "https://farm8.staticflickr.com/137/383965780_db4815011c_o.jpg"},
        {"text": "Which wooden stool has a vase with red flower on it?"},
    ]
)
response, history = model.chat(tokenizer, query=query, history=None)
```

## Citation

```
@article{2023vlfeedback,
  author      = {Lei Li and Zhihui Xie and Mukai Li and Shunian Chen and Peiyi Wang and Liang Chen and  Yazheng Yang and  Benyou Wang and  Lingpeng Kong},
  title       = {Silkie: Preference Distillation for Large Visual Language Models},
  publisher   = {arXiv:2312.10665},
  year        = {2023}
}
```