File size: 3,778 Bytes
80c6598 0297db8 3a858ea db7ffe4 80c6598 1fe5c53 80c6598 83b9189 80c6598 90cd162 80c6598 fc9ff17 80c6598 817d582 3081d81 7923606 3081d81 817d582 3081d81 817d582 a1ac49a 817d582 34ec5c6 817d582 34ec5c6 817d582 34ec5c6 817d582 34ec5c6 02714ae 817d582 34ec5c6 02714ae 817d582 34ec5c6 817d582 c37926b 14fb892 1de78e8 a454b7c a22168a db7ffe4 7c03122 a22168a fe21202 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 |
---
language:
- en
tags:
- information retrieval
- embedding model
- visual information retrieval
metrics:
- recall
pipeline_tag: feature-extraction
license: apache-2.0
---
# OCR-free Visual Document Embedding Model as Your Personal Librarian
The model only takes images as document-side inputs and produce vectors representing document pages. `minicpm-visual-embedding-v0` is trained with over 200k query-visual document pairs, including textual document, visual document, arxiv figures, industry documents, textbooks, ebooks, etc. The performance of `minicpm-visual-embedding-v0` is on a par with our ablation text embedding model on text-oriented documents, and an advantages on visually-intensive documents.
![Memex Archtechture](images/memex.png)
# News
- 2024-06-27: ๐ We released our first visual embedding model checkpoint minicpm-visual-embedding-v0 on [huggingface](https://huggingface.co/RhapsodyAI/minicpm-visual-embedding-v0).
- 2024-05-08: ๐ We [open-sourced](https://github.com/RhapsodyAILab/minicpm-visual-embedding-v0) our training code (full-parameter tuning with GradCache and DeepSpeed, supports large batch size across multiple GPUs with zero-stage1) and eval code.
# Get started
Pip install all dependencies:
```
Pillow==10.1.0
timm==0.9.10
torch==2.1.2
torchvision==0.16.2
transformers==4.36.0
sentencepiece==0.1.99
numpy==1.26.0
```
First you are suggested to git clone this huggingface repo or download repo with `huggingface_cli`.
```bash
git lfs install
git clone https://huggingface.co/RhapsodyAI/minicpm-visual-embedding-v0
```
or
```bash
huggingface-cli download --resume-download RhapsodyAI/minicpm-visual-embedding-v0 --local-dir minicpm-visual-embedding-v0 --local-dir-use-symlinks False
```
```python
from transformers import AutoModel
from transformers import AutoTokenizer
from PIL import Image
import torch
device = 'cuda:0'
# Load model, be sure to substitute `model_path` by your model path
model_path = '/local/path/to/model'
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_pretrained(model_path, trust_remote_code=True)
model.to(device)
# Load image to PIL.Image object
image_1 = Image.open('/local/path/to/images/memex.png').convert('RGB')
image_2 = Image.open('/local/path/to/images/us2020.png').convert('RGB')
image_3 = Image.open('/local/path/to/images/hard_negative.png').convert('RGB')
# User query
query_instruction = 'Represent this query for retrieving relavant document: '
query = 'Who was elected as president of United States in 2020?'
query_full = query_instruction + query
# Embed image documents
with torch.no_grad():
p_reps = model(text=['', '', ''], image=[image_1, image_2, image_3], tokenizer=tokenizer).reps
# Embed text queries
with torch.no_grad():
q_reps = model(text=[query_full], image=[None], tokenizer=tokenizer).reps # [B, s, d]
# Calculate similarities
scores = torch.matmul(q_reps, p_reps.T)
print(scores)
# tensor([[-0.0112, 0.3316, 0.2376]], device='cuda:0')
```
# Limitations
- This checkpoint is an alpha version, and may not be strong in your tasks, for bad case, please create an issue to let us know, many thanks!
- The modeling script `modeling_minicpmv` on `huggingface` is not standard yet, the inference code could be further improved.
- The inference speed is low, because vision encoder uses `timm`, which does not yet support `flash-attn`.
# Citation
If you find our work useful, please consider cite us:
```bibtex
@misc{RhapsodyEmbedding2024,
author = {RhapsodyAI},
title = {OCR-free Visual Document Embedding Model as Your Personal Librarian},
year = {2024},
howpublished = {\url{https://huggingface.co/RhapsodyAI/minicpm-visual-embedding-v0}},
note = {Accessed: 2024-06-28}
}
``` |