numbmelon commited on
Commit
4df6e76
·
verified ·
1 Parent(s): ba129d7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +86 -1
README.md CHANGED
@@ -5,4 +5,89 @@ base_model: Qwen/Qwen2-VL-7B-Instruct
5
  pipeline_tag: image-text-to-text
6
  ---
7
 
8
- This repository contains the model of the paper [OS-ATLAS: A Foundation Action Model for Generalist GUI Agents](https://huggingface.co/papers/2410.23218).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  pipeline_tag: image-text-to-text
6
  ---
7
 
8
+ This repository contains the model of the paper [OS-ATLAS: A Foundation Action Model for Generalist GUI Agents](https://huggingface.co/papers/2410.23218).
9
+
10
+ <div align="center">
11
+
12
+ [\[🏠Homepage\]](https://osatlas.github.io) [\[💻Code\]](https://github.com/OS-Copilot/OS-Atlas) [\[🚀Quick Start\]](#quick-start) [\[📝Paper\]](https://arxiv.org/abs/2410.23218) [\[🤗Models\]](https://huggingface.co/collections/OS-Copilot/os-atlas-67246e44003a1dfcc5d0d045) [\[🤗ScreenSpot-v2\]](https://huggingface.co/datasets/OS-Copilot/ScreenSpot-v2)
13
+
14
+ </div>
15
+
16
+ ![os-atlas](https://github.com/user-attachments/assets/cf2ee020-5e15-4087-9a7e-75cc43662494)
17
+
18
+ ## Quick Start
19
+ OS-Atlas-Base-7B is a GUI grounding model finetuned from [Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct).
20
+ **Notes:** Our models accept images of any size as input. The model outputs are normalized to relative coordinates within a 0-1000 range (either a center point or a bounding box defined by top-left and bottom-right coordinates). For visualization, please remember to convert these relative coordinates back to the original image dimensions.
21
+
22
+ ### Inference Example
23
+ First, ensure that the necessary dependencies are installed:
24
+ ```
25
+ pip install transformers
26
+ pip install qwen-vl-utils
27
+ ```
28
+
29
+ Inference code example:
30
+ ```python
31
+ from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
32
+ from qwen_vl_utils import process_vision_info
33
+
34
+ # Default: Load the model on the available device(s)
35
+ model = Qwen2VLForConditionalGeneration.from_pretrained(
36
+ "OS-Copilot/OS-Atlas-Base-7B", torch_dtype="auto", device_map="auto"
37
+ )
38
+ processor = AutoProcessor.from_pretrained("OS-Copilot/OS-Atlas-Base-7B")
39
+
40
+ messages = [
41
+ {
42
+ "role": "user",
43
+ "content": [
44
+ {
45
+ "type": "image",
46
+ "image": "https://github.com/OS-Copilot/OS-Atlas/blob/main/exmaples/images/web_6f93090a-81f6-489e-bb35-1a2838b18c01.png",
47
+ },
48
+ {"type": "text", "text": "In this UI screenshot, what is the position of the element corresponding to the command \"switch language of current page\" (with bbox)?"},
49
+ ],
50
+ }
51
+ ]
52
+
53
+
54
+ # Preparation for inference
55
+ text = processor.apply_chat_template(
56
+ messages, tokenize=False, add_generation_prompt=True
57
+ )
58
+ image_inputs, video_inputs = process_vision_info(messages)
59
+ inputs = processor(
60
+ text=[text],
61
+ images=image_inputs,
62
+ videos=video_inputs,
63
+ padding=True,
64
+ return_tensors="pt",
65
+ )
66
+ inputs = inputs.to("cuda")
67
+
68
+ # Inference: Generation of the output
69
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
70
+
71
+ generated_ids_trimmed = [
72
+ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
73
+ ]
74
+
75
+ output_text = processor.batch_decode(
76
+ generated_ids_trimmed, skip_special_tokens=False, clean_up_tokenization_spaces=False
77
+ )
78
+ print(output_text)
79
+ # <|object_ref_start|>language switch<|object_ref_end|><|box_start|>(576,12),(592,42)<|box_end|><|im_end|>
80
+ ```
81
+
82
+
83
+
84
+ ## Citation
85
+ If you find this repository helpful, feel free to cite our paper:
86
+ ```bibtex
87
+ @article{wu2024atlas,
88
+ title={OS-ATLAS: A Foundation Action Model for Generalist GUI Agents},
89
+ author={Wu, Zhiyong and Wu, Zhenyu and Xu, Fangzhi and Wang, Yian and Sun, Qiushi and Jia, Chengyou and Cheng, Kanzhi and Ding, Zichen and Chen, Liheng and Liang, Paul Pu and others},
90
+ journal={arXiv preprint arXiv:2410.23218},
91
+ year={2024}
92
+ }
93
+ ```