ZinengTang
/

Udop

Model card Files Files and versions Community

Udop / README.md

ZinengTang

Update README.md

a29d0f7 over 1 year ago

preview code

raw

history blame

2.86 kB

	---
	license: mit
	---

	# [Unifying Vision, Text, and Layout for Universal Document Processing (CVPR 2023 Highlight)](https://arxiv.org/pdf/2212.02623)
	[Zineng Tang](https://zinengtang.github.io/),
	[Ziyi Yang](https://ziyi-yang.github.io/),
	[Guoxin Wang](https://www.guoxwang.com/),
	[Yuwei Fang](https://www.microsoft.com/en-us/research/people/yuwfan/),
	[Yang Liu](https://nlp-yang.github.io/),
	[Chenguang Zhu](https://cs.stanford.edu/people/cgzhu/),
	[Michael Zeng](https://www.microsoft.com/en-us/research/people/nzeng/),
	[Cha Zhang](https://www.microsoft.com/en-us/research/people/chazhang/),
	[Mohit Bansal](https://www.cs.unc.edu/~mbansal/)


	Open Source Checklist:

	- [x] Release Model (Encoder + Text decoder)
	- [x] Release Most Scripts
	- [ ] Vision Decoder / Weights (Due to fake document generation ethical consideration, we plan to release this functionality as an Azure API)
	- [x] Demo

	## Introduction

	UDOP unifies vision, text, and layout through vision-text-layout Transformer and unified generative pretraining tasks including
	vision task, text task, layout task, and mixed task. We show the task prompts (left) and task targets (right) for all self-supervised objectives
	(joint text-layout reconstruction, visual text recognition, layout modeling, and masked autoencoding) and two example supervised objectives
	(question answering and layout analysis).

	## Install
	### Setup `python` environment
	```
	conda create -n UDOP python=3.8 # You can also use other environment.
	```
	### Install other dependencies
	```
	pip install -r requirements.txt
	```

	## Run Scripts

	Switch model type by:

	--model_type "UdopDual"

	--model_type "UdopUnimodel"

	### Finetuninng on RVLCDIP

	Download RVLCDIP first and change the path
	For OCR, you might need to customize your code
	```
	bash scripts/finetune_rvlcdip.sh # Finetuning on RVLCDIP
	```

	### Finetuninng on DUE Benchmark

	Download [Duebenchmark](https://github.com/due-benchmark/baselines) and follow its procedure to preprocess the data.

	The training code adapted to our framework is hosted at benchmarker by running:

	```
	bash scripts/finetune_duebenchmark.sh # Finetuning on DUE Benchmark, Switch tasks by changing path to the dataset
	```

	Evaluation of the output generation can be evaluated by [Duebenchmark due_evaluator](https://github.com/due-benchmark/evaluator)

	### Model Checkpoints
	The model checkpoints are hosted here [Huggingface Hub](https://huggingface.co/ZinengTang/Udop)

	## Citation
	```
	@article{tang2022unifying,
	title={Unifying Vision, Text, and Layout for Universal Document Processing},
	author={Tang, Zineng and Yang, Ziyi and Wang, Guoxin and Fang, Yuwei and Liu, Yang and Zhu, Chenguang and Zeng, Michael and Zhang, Cha and Bansal, Mohit},
	journal={arXiv preprint arXiv:2212.02623},
	year={2022}
	}
	```

	## Contact

	Zineng Tang (zn.tang.terran@gmail.com)