VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

• 📖 Introduction • 🎉 News • ✨ VisRAG Pipeline • ⚡️ Training

• 📦 Requirements • 🔧 Usage • 📄 Lisense • 📑 Citation • 📧 Contact

📖 Introduction

VisRAG is a novel vision-language model (VLM)-based RAG pipeline. In this pipeline, instead of first parsing the document to obtain text, the document is directly embedded using a VLM as an image and then retrieved to enhance the generation of a VLM.Compared to traditional text-based RAG, VisRAG maximizes the retention and utilization of the data information in the original documents, eliminating the information loss introduced during the parsing process.

🎉 News

20241104: Released our VisRAG Pipeline on Hugging Face Space.
20241031: Released our VisRAG Pipeline on Colab.
20241015: Released our train data and test data on Hugging Face which can be found in the VisRAG Collection on Hugging Face. It is referenced at the beginning of this page.
20241014: Released our Paper on arXiv. Released our Model on Hugging Face. Released our Code on GitHub.

✨ VisRAG Pipeline

VisRAG-Ret

VisRAG-Ret is a document embedding model built on MiniCPM-V 2.0, a vision-language model that integrates SigLIP as the vision encoder and MiniCPM-2B as the language model.

VisRAG-Gen

In the paper, We use MiniCPM-V 2.0, MiniCPM-V 2.6 and GPT-4o as the generators. Actually you can use any VLMs you like!

⚡️ Training

VisRAG-Ret

Our training dataset of 362,110 Query-Document (Q-D) Pairs for VisRAG-Ret is comprised of train sets of openly available academic datasets (34%) and a synthetic dataset made up of pages from web-crawled PDF documents and augmented with VLM-generated (GPT-4o) pseudo-queries (66%). It can be found in the VisRAG Collection on Hugging Face, which is referenced at the beginning of this page.

VisRAG-Gen

The generation part does not use any fine-tuning; we directly use off-the-shelf LLMs/VLMs for generation.

📦 Requirements

torch==2.1.2
torchvision==0.16.2
transformers==4.40.2
sentencepiece==0.1.99
decord==0.6.0
Pillow==10.1.0

🔧 Usage

VisRAG-Ret

from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn.functional as F
from PIL import Image
import requests
from io import BytesIO

def weighted_mean_pooling(hidden, attention_mask):
    attention_mask_ = attention_mask * attention_mask.cumsum(dim=1)
    s = torch.sum(hidden * attention_mask_.unsqueeze(-1).float(), dim=1)
    d = attention_mask_.sum(dim=1, keepdim=True).float()
    reps = s / d
    return reps

@torch.no_grad()
def encode(text_or_image_list):
    
    if (isinstance(text_or_image_list[0], str)):
        inputs = {
            "text": text_or_image_list,
            'image': [None] * len(text_or_image_list),
            'tokenizer': tokenizer
        }
    else:
        inputs = {
            "text": [''] * len(text_or_image_list),
            'image': text_or_image_list,
            'tokenizer': tokenizer
        }
    outputs = model(**inputs)
    attention_mask = outputs.attention_mask
    hidden = outputs.last_hidden_state

    reps = weighted_mean_pooling(hidden, attention_mask)   
    embeddings = F.normalize(reps, p=2, dim=1).detach().cpu().numpy()
    return embeddings

model_name_or_path = "openbmb/VisRAG-Ret"
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name_or_path, torch_dtype=torch.bfloat16, trust_remote_code=True).cuda()
model.eval()

queries = ["What does a dog look like?"]
INSTRUCTION = "Represent this query for retrieving relevant documents: "
queries = [INSTRUCTION + query for query in queries]

print("Downloading images...")
passages = [
    Image.open(BytesIO(requests.get(
        'https://github.com/OpenBMB/VisRAG/raw/refs/heads/master/scripts/demo/retriever/test_image/cat.jpeg'
    ).content)).convert('RGB'),
    Image.open(BytesIO(requests.get(
        'https://github.com/OpenBMB/VisRAG/raw/refs/heads/master/scripts/demo/retriever/test_image/dog.jpg'
    ).content)).convert('RGB')
]
print("Images downloaded.")

embeddings_query = encode(queries)
embeddings_doc = encode(passages)

scores = (embeddings_query @ embeddings_doc.T)
print(scores.tolist())

📄 License

The code in this repo is released under the Apache-2.0 License.
The usage of VisRAG-Ret model weights must strictly follow MiniCPM Model License.md.
The models and weights of VisRAG-Ret are completely free for academic research. After filling out a "questionnaire" for registration, VisRAG-Ret weights are also available for free commercial use.

📑 Citation

@misc{yu2024visragvisionbasedretrievalaugmentedgeneration,
      title={VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents}, 
      author={Shi Yu and Chaoyue Tang and Bokai Xu and Junbo Cui and Junhao Ran and Yukun Yan and Zhenghao Liu and Shuo Wang and Xu Han and Zhiyuan Liu and Maosong Sun},
      year={2024},
      eprint={2410.10594},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2410.10594}, 
}

📧 Contact

Shi Yu: yus21@mails.tsinghua.edu.cn
Chaoyue Tang: tcy006@gmail.com

openbmb
/

VisRAG-Ret

VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

📖 Introduction

🎉 News

✨ VisRAG Pipeline

VisRAG-Ret

VisRAG-Gen

⚡️ Training

VisRAG-Ret

VisRAG-Gen

📦 Requirements

🔧 Usage

VisRAG-Ret

📄 License

📑 Citation

📧 Contact

Model tree for openbmb/VisRAG-Ret

Datasets used to train openbmb/VisRAG-Ret

Space using openbmb/VisRAG-Ret 1

Collection including openbmb/VisRAG-Ret

VisRAG