Spaces:
Build error
Build error
# OmniParser: Screen Parsing tool for Pure Vision Based GUI Agent | |
<p align="center"> | |
<img src="imgs/logo.png" alt="Logo"> | |
</p> | |
[data:image/s3,"s3://crabby-images/86f94/86f94887d4b6adbd1f64740075880748ddd9ef2f" alt="arXiv"](https://arxiv.org/abs/2408.00203) | |
[data:image/s3,"s3://crabby-images/7a4eb/7a4eb7dde90b3c6effc80e7c87d5259e805747df" alt="License"](https://opensource.org/licenses/MIT) | |
📢 [[Project Page](https://microsoft.github.io/OmniParser/)] [[Blog Post](https://www.microsoft.com/en-us/research/articles/omniparser-for-pure-vision-based-gui-agent/)] [[Models](https://huggingface.co/microsoft/OmniParser)] | |
**OmniParser** is a comprehensive method for parsing user interface screenshots into structured and easy-to-understand elements, which significantly enhances the ability of GPT-4V to generate actions that can be accurately grounded in the corresponding regions of the interface. | |
## News | |
- [2024/10] Both Interactive Region Detection Model and Icon functional description model are released! [Hugginface models](https://huggingface.co/microsoft/OmniParser) | |
- [2024/09] OmniParser achieves the best performance on [Windows Agent Arena](https://microsoft.github.io/WindowsAgentArena/)! | |
## Install | |
Install environment: | |
```python | |
conda create -n "omni" python==3.12 | |
conda activate omni | |
pip install -r requirements.txt | |
``` | |
Then download the model ckpts files in: https://huggingface.co/microsoft/OmniParser, and put them under weights/, default folder structure is: weights/icon_detect, weights/icon_caption_florence, weights/icon_caption_blip2. | |
Finally, convert the safetensor to .pt file. | |
```python | |
python weights/convert_safetensor_to_pt.py | |
``` | |
## Examples: | |
We put together a few simple examples in the demo.ipynb. | |
## Gradio Demo | |
To run gradio demo, simply run: | |
```python | |
python gradio_demo.py | |
``` | |
## 📚 Citation | |
Our technical report can be found [here](https://arxiv.org/abs/2408.00203). | |
If you find our work useful, please consider citing our work: | |
``` | |
@misc{lu2024omniparserpurevisionbased, | |
title={OmniParser for Pure Vision Based GUI Agent}, | |
author={Yadong Lu and Jianwei Yang and Yelong Shen and Ahmed Awadallah}, | |
year={2024}, | |
eprint={2408.00203}, | |
archivePrefix={arXiv}, | |
primaryClass={cs.CV}, | |
url={https://arxiv.org/abs/2408.00203}, | |
} | |
``` | |