numbmelon commited on
Commit
a8b5d82
1 Parent(s): 3ec955f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +139 -0
README.md CHANGED
@@ -28,6 +28,145 @@ For generating single-step actions in GUI agent tasks, you can use:
28
 
29
  ## OS-Atlas-Action-7B
30
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
 
32
 
33
  ## Citation
 
28
 
29
  ## OS-Atlas-Action-7B
30
 
31
+ `OS-Atlas-Action-7B` is a GUI action model finetuned from OS-Atlas-Base-7B. By taking as input a system prompt, basic and custom actions, and a task instruction, the model generates thoughtful reasoning (`thought`) and executes the appropriate next step (`action`).
32
+
33
+ ### Installation
34
+ To use `OS-Atlas-Action-7B`, first install the necessary dependencies:
35
+ ```bash
36
+ pip install transformers
37
+ pip install qwen-vl-utils
38
+ ```
39
+
40
+ ### Example Inference Code
41
+ Below is an example of how to perform inference using the model:
42
+
43
+ ```python
44
+ from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
45
+ from qwen_vl_utils import process_vision_info
46
+
47
+ # Load the model and processor
48
+ model = Qwen2VLForConditionalGeneration.from_pretrained(
49
+ "/nas/shared/NLP_A100/wuzhenyu/ckpt/241029-qwen-stage2", torch_dtype="auto", device_map="auto"
50
+ )
51
+ processor = AutoProcessor.from_pretrained(
52
+ "/nas/shared/NLP_A100/wuzhenyu/ckpt/20240928_finetune_qwen_7b_3m_imgsiz_1024_bs_1024_lr_1e-7_wd_1e-3_mixture"
53
+ )
54
+
55
+ # Define the system prompt
56
+ sys_prompt = """
57
+ You are now operating in Executable Language Grounding mode. Your goal is to help users accomplish tasks by suggesting executable actions that best fit their needs. Your skill set includes both basic and custom actions:
58
+
59
+ 1. Basic Actions
60
+ Basic actions are standardized and available across all platforms. They provide essential functionality and are defined with a specific format, ensuring consistency and reliability.
61
+ Basic Action 1: CLICK
62
+ - purpose: Click at the specified position.
63
+ - format: CLICK <point>[[x-axis, y-axis]]</point>
64
+ - example usage: CLICK <point>[[101, 872]]</point>
65
+
66
+ Basic Action 2: TYPE
67
+ - purpose: Enter specified text at the designated location.
68
+ - format: TYPE [input text]
69
+ - example usage: TYPE [Shanghai shopping mall]
70
+
71
+ Basic Action 3: SCROLL
72
+ - purpose: SCROLL in the specified direction.
73
+ - format: SCROLL [direction (UP/DOWN/LEFT/RIGHT)]
74
+ - example usage: SCROLL [UP]
75
+
76
+ 2. Custom Actions
77
+ Custom actions are unique to each user's platform and environment. They allow for flexibility and adaptability, enabling the model to support new and unseen actions defined by users. These actions extend the functionality of the basic set, making the model more versatile and capable of handling specific tasks.
78
+ Custom Action 1: LONG_PRESS
79
+ - purpose: Long press at the specified position.
80
+ - format: LONG_PRESS <point>[[x-axis, y-axis]]</point>
81
+ - example usage: LONG_PRESS <point>[[101, 872]]</point>
82
+
83
+ Custom Action 2: OPEN_APP
84
+ - purpose: Open the specified application.
85
+ - format: OPEN_APP [app_name]
86
+ - example usage: OPEN_APP [Google Chrome]
87
+
88
+ Custom Action 3: PRESS_BACK
89
+ - purpose: Press a back button to navigate to the previous screen.
90
+ - format: PRESS_BACK
91
+ - example usage: PRESS_BACK
92
+
93
+ Custom Action 4: PRESS_HOME
94
+ - purpose: Press a home button to navigate to the home page.
95
+ - format: PRESS_HOME
96
+ - example usage: PRESS_HOME
97
+
98
+ Custom Action 5: PRESS_RECENT
99
+ - purpose: Press the recent button to view or switch between recently used applications.
100
+ - format: PRESS_RECENT
101
+ - example usage: PRESS_RECENT
102
+
103
+ Custom Action 6: ENTER
104
+ - purpose: Press the enter button.
105
+ - format: ENTER
106
+ - example usage: ENTER
107
+
108
+ Custom Action 7: WAIT
109
+ - purpose: Wait for the screen to load.
110
+ - format: WAIT
111
+ - example usage: WAIT
112
+
113
+ Custom Action 8: COMPLETE
114
+ - purpose: Indicate the task is finished.
115
+ - format: COMPLETE
116
+ - example usage: COMPLETE
117
+
118
+ In most cases, task instructions are high-level and abstract. Carefully read the instruction and action history, then perform reasoning to determine the most appropriate next action. Ensure you strictly generate two sections: Thoughts and Actions.
119
+ Thoughts: Clearly outline your reasoning process for current step.
120
+ Actions: Specify the actual actions you will take based on your reasoning. You should follow action format above when generating.
121
+
122
+ Your current task instruction, action history, and associated screenshot are as follows:
123
+ Screenshot:
124
+ """
125
+
126
+ # Define the input message
127
+ messages = [
128
+ {
129
+ "role": "user",
130
+ "content": [
131
+ {
132
+ "type": "text", "text": sys_prompt,
133
+ },
134
+ {
135
+ "type": "image",
136
+ "image": "https://github.com/OS-Copilot/OS-Atlas/blob/main/exmaples/images/action_example_1.jpg",
137
+ },
138
+ {"type": "text", "text": "Task instruction: to allow the user to enter their first name\nHistory: null" },
139
+ ],
140
+ }
141
+ ]
142
+
143
+ # Prepare the input for the model
144
+ text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
145
+ image_inputs, video_inputs = process_vision_info(messages)
146
+ inputs = processor(
147
+ text=[text],
148
+ images=image_inputs,
149
+ videos=video_inputs,
150
+ padding=True,
151
+ return_tensors="pt",
152
+ )
153
+ inputs = inputs.to("cuda")
154
+
155
+ # Generate output
156
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
157
+
158
+ # Post-process the output
159
+ generated_ids_trimmed = [
160
+ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
161
+ ]
162
+ output_text = processor.batch_decode(
163
+ generated_ids_trimmed, skip_special_tokens=False, clean_up_tokenization_spaces=False
164
+ )
165
+ print(output_text)
166
+
167
+ ```
168
+
169
+
170
 
171
 
172
  ## Citation