File size: 10,305 Bytes
2a28f11 0da61bc 2a28f11 7625c55 2a28f11 06b790b 2a28f11 06b790b 2a28f11 0da61bc 2a28f11 06b790b 2a28f11 7625c55 0da61bc 2a28f11 7625c55 2a28f11 902f8ad 2a28f11 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 |
---
license: apache-2.0
library_name: transformers
base_model: OpenGVLab/InternVL2-4B
pipeline_tag: image-text-to-text
---
# OS-Atlas: A Foundation Action Model For Generalist GUI Agents
<div align="center">
[\[🏠Homepage\]](https://osatlas.github.io) [\[💻Code\]](https://github.com/OS-Copilot/OS-Atlas) [\[🚀Quick Start\]](#quick-start) [\[📝Paper\]](https://arxiv.org/abs/2410.23218) [\[🤗Models\]](https://huggingface.co/collections/OS-Copilot/os-atlas-67246e44003a1dfcc5d0d045)[\[🤗Data\]](https://huggingface.co/datasets/OS-Copilot/OS-Atlas-data) [\[🤗ScreenSpot-v2\]](https://huggingface.co/datasets/OS-Copilot/ScreenSpot-v2)
</div>
## Overview
![os-atlas](https://github.com/user-attachments/assets/cf2ee020-5e15-4087-9a7e-75cc43662494)
OS-Atlas provides a series of models specifically designed for GUI agents.
For GUI grounding tasks, you can use:
- [OS-Atlas-Base-7B](https://huggingface.co/OS-Copilot/OS-Atlas-Base-7B)
- [OS-Atlas-Base-4B](https://huggingface.co/OS-Copilot/OS-Atlas-Base-4B)
For generating single-step actions in GUI agent tasks, you can use:
- [OS-Atlas-Pro-7B](https://huggingface.co/OS-Copilot/OS-Atlas-Pro-7B)
- [OS-Atlas-Pro-4B](https://huggingface.co/OS-Copilot/OS-Atlas-Pro-4B)
## OS-Atlas-Pro-4B
`OS-Atlas-Pro-4B` is a GUI action model finetuned from OS-Atlas-Base-4B. By taking as input a system prompt, basic and custom actions, and a task instruction, the model generates thoughtful reasoning (`thought`) and executes the appropriate next step (`action`).
Note that the released `OS-Atlas-Pro-4B` model is described in the Section 5.4 of the paper. Compared to the OS-Atlas model in Tables 4 and 5, the Pro model demonstrates superior generalizability and performance. Critically, it is not constrained to specific tasks or training datasets merely to satisfy particular experimental conditions like OOD and SFT. Furthermore, this approach prevents us from overdosing HuggingFace by uploading over 20+ distinct model checkpoints.
### Installation
To use `OS-Atlas-Pro-4B`, first install the necessary dependencies:
```
pip install transformers
```
For additional dependencies, please refer to the [InternVL2 documentation](https://internvl.readthedocs.io/en/latest/get_started/installation.html)
### Example Inference Code
First download the [example image](https://github.com/OS-Copilot/OS-Atlas/blob/main/examples/images/action_example_1.jpg) and save it to the current directory.
Below is an example of how to perform inference using the model:
```python
import torch
import torchvision.transforms as T
from PIL import Image
from transformers import set_seed
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer
IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)
set_seed(1234)
def build_transform(input_size):
MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
transform = T.Compose([
T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
T.ToTensor(),
T.Normalize(mean=MEAN, std=STD)
])
return transform
def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
best_ratio_diff = float('inf')
best_ratio = (1, 1)
area = width * height
for ratio in target_ratios:
target_aspect_ratio = ratio[0] / ratio[1]
ratio_diff = abs(aspect_ratio - target_aspect_ratio)
if ratio_diff < best_ratio_diff:
best_ratio_diff = ratio_diff
best_ratio = ratio
elif ratio_diff == best_ratio_diff:
if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
best_ratio = ratio
return best_ratio
def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
orig_width, orig_height = image.size
aspect_ratio = orig_width / orig_height
# calculate the existing image aspect ratio
target_ratios = set(
(i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
i * j <= max_num and i * j >= min_num)
target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
# find the closest aspect ratio to the target
target_aspect_ratio = find_closest_aspect_ratio(
aspect_ratio, target_ratios, orig_width, orig_height, image_size)
# calculate the target width and height
target_width = image_size * target_aspect_ratio[0]
target_height = image_size * target_aspect_ratio[1]
blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
# resize the image
resized_img = image.resize((target_width, target_height))
processed_images = []
for i in range(blocks):
box = (
(i % (target_width // image_size)) * image_size,
(i // (target_width // image_size)) * image_size,
((i % (target_width // image_size)) + 1) * image_size,
((i // (target_width // image_size)) + 1) * image_size
)
# split the image
split_img = resized_img.crop(box)
processed_images.append(split_img)
assert len(processed_images) == blocks
if use_thumbnail and len(processed_images) != 1:
thumbnail_img = image.resize((image_size, image_size))
processed_images.append(thumbnail_img)
return processed_images
def load_image(image_file, input_size=448, max_num=6):
image = Image.open(image_file).convert('RGB')
transform = build_transform(input_size=input_size)
images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
pixel_values = [transform(image) for image in images]
pixel_values = torch.stack(pixel_values)
return pixel_values
# If you want to load a model using multiple GPUs, please refer to the `Multiple GPUs` section.
path = './action_example_1.jpg' # change to your example image path
model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
# set the max number of tiles in `max_num`
pixel_values = load_image('./examples/images/action_example_1.jpg', max_num=6).to(torch.bfloat16).cuda()
generation_config = dict(max_new_tokens=1024, do_sample=True)
sys_prompt = """
You are now operating in Executable Language Grounding mode. Your goal is to help users accomplish tasks by suggesting executable actions that best fit their needs. Your skill set includes both basic and custom actions:
1. Basic Actions
Basic actions are standardized and available across all platforms. They provide essential functionality and are defined with a specific format, ensuring consistency and reliability.
Basic Action 1: CLICK
- purpose: Click at the specified position.
- format: CLICK <point>[[x-axis, y-axis]]</point>
- example usage: CLICK <point>[[101, 872]]</point>
Basic Action 2: TYPE
- purpose: Enter specified text at the designated location.
- format: TYPE [input text]
- example usage: TYPE [Shanghai shopping mall]
Basic Action 3: SCROLL
- purpose: SCROLL in the specified direction.
- format: SCROLL [direction (UP/DOWN/LEFT/RIGHT)]
- example usage: SCROLL [UP]
2.Custom Actions
Custom actions are unique to each user's platform and environment. They allow for flexibility and adaptability, enabling the model to support new and unseen actions defined by users. These actions extend the functionality of the basic set, making the model more versatile and capable of handling specific tasks.
Custom Action 1: LONG_PRESS
- purpose: Long press at the specified position.
- format: LONG_PRESS <point>[[x-axis, y-axis]]</point>
- example usage: LONG_PRESS <point>[[101, 872]]</point>
Custom Action 2: OPEN_APP
- purpose: Open the specified application.
- format: OPEN_APP [app_name]
- example usage: OPEN_APP [Google Chrome]
Custom Action 3: PRESS_BACK
- purpose: Press a back button to navigate to the previous screen.
- format: PRESS_BACK
- example usage: PRESS_BACK
Custom Action 4: PRESS_HOME
- purpose: Press a home button to navigate to the home page.
- format: PRESS_HOME
- example usage: PRESS_HOME
Custom Action 5: PRESS_RECENT
- purpose: Press the recent button to view or switch between recently used applications.
- format: PRESS_RECENT
- example usage: PRESS_RECENT
Custom Action 6: ENTER
- purpose: Press the enter button.
- format: ENTER
- example usage: ENTER
Custom Action 7: WAIT
- purpose: Wait for the screen to load.
- format: WAIT
- example usage: WAIT
Custom Action 8: COMPLETE
- purpose: Indicate the task is finished.
- format: COMPLETE
- example usage: COMPLETE
In most cases, task instructions are high-level and abstract. Carefully read the instruction and action history, then perform reasoning to determine the most appropriate next action. Ensure you strictly generate two sections: Thoughts and Actions.
Thoughts: Clearly outline your reasoning process for current step.
Actions: Specify the actual actions you will take based on your reasoning. You should follow action format above when generating.
Your current task instruction, action history, and associated screenshot are as follows:
Screenshot:
<image>
Task instruction: {}
History: null
"""
question = sys_prompt.format("to allow the user to enter their first name")
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
print(f'Assistant:\n{response}')
# Assistant:
# thoughts:
# click on the first name field
# actions:
# CLICK [[362,527]]
```
## Citation
If you find this repository helpful, feel free to cite our paper:
```bibtex
@article{wu2024atlas,
title={OS-ATLAS: A Foundation Action Model for Generalist GUI Agents},
author={Wu, Zhiyong and Wu, Zhenyu and Xu, Fangzhi and Wang, Yian and Sun, Qiushi and Jia, Chengyou and Cheng, Kanzhi and Ding, Zichen and Chen, Liheng and Liang, Paul Pu and others},
journal={arXiv preprint arXiv:2410.23218},
year={2024}
}
```
|