Model for the paper: Harnessing Webpage Uis For Text Rich Visual Understanding

🌐 Homepage | 🐍 GitHub | 📖 arXiv

Introduction

We introduce MultiUI, a dataset containing 7.3 million samples from 1 million websites, covering diverse multi- modal tasks and UI layouts. Models trained on MultiUI not only excel in web UI tasks—achieving up to a 48% improvement on VisualWebBench and a 19.1% boost in action accuracy on a web agent dataset Mind2Web—but also generalize surprisingly well to non-web UI tasks and even to non-UI domains, such as document understanding, OCR, and chart interpretation.

Training & Evaluation

The model training is based on the LLaVA-NeXT.

For deployment, refer to SGLang deployment section in LLaVA-NeXT repo.

For benchmark evaluation, the awesome lmms-eval package is used. Check our repo MultiUI to evaluate on benchmarks mentioned in the paper.

Model Performance

image/png

image/png

image/png

Contact

Citation

If you find this work helpful, please cite out paper:

@misc{liu2024harnessingwebpageuistextrich,
      title={Harnessing Webpage UIs for Text-Rich Visual Understanding}, 
      author={Junpeng Liu and Tianyue Ou and Yifan Song and Yuxiao Qu and Wai Lam and Chenyan Xiong and Wenhu Chen and Graham Neubig and Xiang Yue},
      year={2024},
      eprint={2410.13824},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2410.13824}, 
}
Downloads last month
826
Safetensors
Model size
8.03B params
Tensor type
BF16
·
Inference API
Unable to determine this model's library. Check the docs .

Model tree for neulab/UIX-Qwen2

Base model

Qwen/Qwen2-7B
Finetuned
(59)
this model

Dataset used to train neulab/UIX-Qwen2

Space using neulab/UIX-Qwen2 1

Collection including neulab/UIX-Qwen2