numbmelon commited on
Commit
2a28f11
1 Parent(s): 510a803

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +233 -0
README.md ADDED
@@ -0,0 +1,233 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: transformers
4
+ base_model: OpenGVLab/InternVL2-4B
5
+ pipeline_tag: image-text-to-text
6
+ ---
7
+
8
+ # OS-Atlas: A Foundation Action Model For Generalist GUI Agents
9
+
10
+ <div align="center">
11
+
12
+ [\[🏠Homepage\]](https://osatlas.github.io) [\[💻Code\]](https://github.com/OS-Copilot/OS-Atlas) [\[🚀Quick Start\]](#quick-start) [\[📝Paper\]](https://arxiv.org/abs/2410.23218) [\[🤗Models\]](https://huggingface.co/collections/OS-Copilot/os-atlas-67246e44003a1dfcc5d0d045) [\[🤗ScreenSpot-v2\]](https://huggingface.co/datasets/OS-Copilot/ScreenSpot-v2)
13
+
14
+ </div>
15
+
16
+ ## Overview
17
+ ![os-atlas](https://github.com/user-attachments/assets/cf2ee020-5e15-4087-9a7e-75cc43662494)
18
+
19
+ OS-Atlas provides a series of models specifically designed for GUI agents.
20
+
21
+ For GUI grounding tasks, you can use:
22
+ - [OS-Atlas-Base-7B](https://huggingface.co/OS-Copilot/OS-Atlas-Base-7B)
23
+ - [OS-Atlas-Base-4B](https://huggingface.co/OS-Copilot/OS-Atlas-Base-4B)
24
+
25
+ For generating single-step actions in GUI agent tasks, you can use:
26
+ - [OS-Atlas-Action-7B](https://huggingface.co/OS-Copilot/OS-Atlas-Action-7B)
27
+ - [OS-Atlas-Action-4B](https://huggingface.co/OS-Copilot/OS-Atlas-Action-4B)
28
+
29
+ ## OS-Atlas-Action-4B
30
+
31
+ `OS-Atlas-Action-4B` is a GUI action model finetuned from OS-Atlas-Base-4B. By taking as input a system prompt, basic and custom actions, and a task instruction, the model generates thoughtful reasoning (`thought`) and executes the appropriate next step (`action`).
32
+
33
+ ### Installation
34
+ To use `OS-Atlas-Action-4B`, first install the necessary dependencies:
35
+ ```
36
+ pip install transformers
37
+ ```
38
+ For additional dependencies, please refer to the [InternVL2 documentation](https://internvl.readthedocs.io/en/latest/get_started/installation.html)
39
+
40
+ ### Example Inference Code
41
+ ```python
42
+ import torch
43
+ import torchvision.transforms as T
44
+ from PIL import Image
45
+ from transformers import set_seed
46
+ from torchvision.transforms.functional import InterpolationMode
47
+ from transformers import AutoModel, AutoTokenizer
48
+ IMAGENET_MEAN = (0.485, 0.456, 0.406)
49
+ IMAGENET_STD = (0.229, 0.224, 0.225)
50
+ set_seed(1234)
51
+
52
+ def build_transform(input_size):
53
+ MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
54
+ transform = T.Compose([
55
+ T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
56
+ T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
57
+ T.ToTensor(),
58
+ T.Normalize(mean=MEAN, std=STD)
59
+ ])
60
+ return transform
61
+
62
+
63
+ def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
64
+ best_ratio_diff = float('inf')
65
+ best_ratio = (1, 1)
66
+ area = width * height
67
+ for ratio in target_ratios:
68
+ target_aspect_ratio = ratio[0] / ratio[1]
69
+ ratio_diff = abs(aspect_ratio - target_aspect_ratio)
70
+ if ratio_diff < best_ratio_diff:
71
+ best_ratio_diff = ratio_diff
72
+ best_ratio = ratio
73
+ elif ratio_diff == best_ratio_diff:
74
+ if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
75
+ best_ratio = ratio
76
+ return best_ratio
77
+
78
+ def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
79
+ orig_width, orig_height = image.size
80
+ aspect_ratio = orig_width / orig_height
81
+
82
+ # calculate the existing image aspect ratio
83
+ target_ratios = set(
84
+ (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
85
+ i * j <= max_num and i * j >= min_num)
86
+ target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
87
+
88
+ # find the closest aspect ratio to the target
89
+ target_aspect_ratio = find_closest_aspect_ratio(
90
+ aspect_ratio, target_ratios, orig_width, orig_height, image_size)
91
+
92
+ # calculate the target width and height
93
+ target_width = image_size * target_aspect_ratio[0]
94
+ target_height = image_size * target_aspect_ratio[1]
95
+ blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
96
+
97
+ # resize the image
98
+ resized_img = image.resize((target_width, target_height))
99
+ processed_images = []
100
+ for i in range(blocks):
101
+ box = (
102
+ (i % (target_width // image_size)) * image_size,
103
+ (i // (target_width // image_size)) * image_size,
104
+ ((i % (target_width // image_size)) + 1) * image_size,
105
+ ((i // (target_width // image_size)) + 1) * image_size
106
+ )
107
+ # split the image
108
+ split_img = resized_img.crop(box)
109
+ processed_images.append(split_img)
110
+ assert len(processed_images) == blocks
111
+ if use_thumbnail and len(processed_images) != 1:
112
+ thumbnail_img = image.resize((image_size, image_size))
113
+ processed_images.append(thumbnail_img)
114
+ return processed_images
115
+
116
+ def load_image(image_file, input_size=448, max_num=6):
117
+ image = Image.open(image_file).convert('RGB')
118
+ transform = build_transform(input_size=input_size)
119
+ images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
120
+ pixel_values = [transform(image) for image in images]
121
+ pixel_values = torch.stack(pixel_values)
122
+ return pixel_values
123
+
124
+ # If you want to load a model using multiple GPUs, please refer to the `Multiple GPUs` section.
125
+ path = 'https://github.com/OS-Copilot/OS-Atlas/blob/main/exmaples/images/action_example_1.jpg'
126
+ model = AutoModel.from_pretrained(
127
+ path,
128
+ torch_dtype=torch.bfloat16,
129
+ low_cpu_mem_usage=True,
130
+ trust_remote_code=True).eval().cuda()
131
+ tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
132
+
133
+ # set the max number of tiles in `max_num`
134
+ pixel_values = load_image('/nas/shared/NLP_A100/wuzhenyu/code/OS-Atlas/exmaples/images/action_example_1.jpg', max_num=6).to(torch.bfloat16).cuda()
135
+ generation_config = dict(max_new_tokens=1024, do_sample=True)
136
+
137
+ sys_prompt = """
138
+ You are now operating in Executable Language Grounding mode. Your goal is to help users accomplish tasks by suggesting executable actions that best fit their needs. Your skill set includes both basic and custom actions:
139
+
140
+ 1. Basic Actions
141
+ Basic actions are standardized and available across all platforms. They provide essential functionality and are defined with a specific format, ensuring consistency and reliability.
142
+ Basic Action 1: CLICK
143
+ - purpose: Click at the specified position.
144
+ - format: CLICK <point>[[x-axis, y-axis]]</point>
145
+ - example usage: CLICK <point>[[101, 872]]</point>
146
+
147
+ Basic Action 2: TYPE
148
+ - purpose: Enter specified text at the designated location.
149
+ - format: TYPE [input text]
150
+ - example usage: TYPE [Shanghai shopping mall]
151
+
152
+ Basic Action 3: SCROLL
153
+ - purpose: SCROLL in the specified direction.
154
+ - format: SCROLL [direction (UP/DOWN/LEFT/RIGHT)]
155
+ - example usage: SCROLL [UP]
156
+
157
+ 2.Custom Actions
158
+ Custom actions are unique to each user's platform and environment. They allow for flexibility and adaptability, enabling the model to support new and unseen actions defined by users. These actions extend the functionality of the basic set, making the model more versatile and capable of handling specific tasks.
159
+ Custom Action 1: LONG_PRESS
160
+ - purpose: Long press at the specified position.
161
+ - format: LONG_PRESS <point>[[x-axis, y-axis]]</point>
162
+ - example usage: LONG_PRESS <point>[[101, 872]]</point>
163
+
164
+ Custom Action 2: OPEN_APP
165
+ - purpose: Open the specified application.
166
+ - format: OPEN_APP [app_name]
167
+ - example usage: OPEN_APP [Google Chrome]
168
+
169
+ Custom Action 3: PRESS_BACK
170
+ - purpose: Press a back button to navigate to the previous screen.
171
+ - format: PRESS_BACK
172
+ - example usage: PRESS_BACK
173
+
174
+ Custom Action 4: PRESS_HOME
175
+ - purpose: Press a home button to navigate to the home page.
176
+ - format: PRESS_HOME
177
+ - example usage: PRESS_HOME
178
+
179
+ Custom Action 5: PRESS_RECENT
180
+ - purpose: Press the recent button to view or switch between recently used applications.
181
+ - format: PRESS_RECENT
182
+ - example usage: PRESS_RECENT
183
+
184
+ Custom Action 6: ENTER
185
+ - purpose: Press the enter button.
186
+ - format: ENTER
187
+ - example usage: ENTER
188
+
189
+ Custom Action 7: WAIT
190
+ - purpose: Wait for the screen to load.
191
+ - format: WAIT
192
+ - example usage: WAIT
193
+
194
+ Custom Action 8: COMPLETE
195
+ - purpose: Indicate the task is finished.
196
+ - format: COMPLETE
197
+ - example usage: COMPLETE
198
+
199
+
200
+ In most cases, task instructions are high-level and abstract. Carefully read the instruction and action history, then perform reasoning to determine the most appropriate next action. Ensure you strictly generate two sections: Thoughts and Actions.
201
+ Thoughts: Clearly outline your reasoning process for current step.
202
+ Actions: Specify the actual actions you will take based on your reasoning. You should follow action format above when generating.
203
+
204
+ Your current task instruction, action history, and associated screenshot are as follows:
205
+ Screenshot:
206
+ <image>
207
+ Task instruction: {}
208
+ History: null
209
+ """
210
+
211
+ question = sys_prompt.format("to allow the user to enter their first name")
212
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
213
+ print(f'Assistant:\n{response}')
214
+
215
+ # Assistant:
216
+ # thoughts:
217
+ # click on the first name field
218
+ # actions:
219
+ # CLICK [[362,527]]
220
+ ```
221
+
222
+
223
+
224
+ ## Citation
225
+ If you find this repository helpful, feel free to cite our paper:
226
+ ```bibtex
227
+ @article{wu2024atlas,
228
+ title={OS-ATLAS: A Foundation Action Model for Generalist GUI Agents},
229
+ author={Wu, Zhiyong and Wu, Zhenyu and Xu, Fangzhi and Wang, Yian and Sun, Qiushi and Jia, Chengyou and Cheng, Kanzhi and Ding, Zichen and Chen, Liheng and Liang, Paul Pu and others},
230
+ journal={arXiv preprint arXiv:2410.23218},
231
+ year={2024}
232
+ }
233
+ ```