OS-Copilot
/

OS-Atlas-Base-7B

@@ -5,4 +5,89 @@ base_model: Qwen/Qwen2-VL-7B-Instruct
 pipeline_tag: image-text-to-text
 ---
-This repository contains the model of the paper [OS-ATLAS: A Foundation Action Model for Generalist GUI Agents](https://huggingface.co/papers/2410.23218).

 pipeline_tag: image-text-to-text
 ---
+This repository contains the model of the paper [OS-ATLAS: A Foundation Action Model for Generalist GUI Agents](https://huggingface.co/papers/2410.23218).
+<div align="center">
+[\[🏠Homepage\]](https://osatlas.github.io) [\[💻Code\]](https://github.com/OS-Copilot/OS-Atlas) [\[🚀Quick Start\]](#quick-start) [\[📝Paper\]](https://arxiv.org/abs/2410.23218) [\[🤗Models\]](https://huggingface.co/collections/OS-Copilot/os-atlas-67246e44003a1dfcc5d0d045) [\[🤗ScreenSpot-v2\]](https://huggingface.co/datasets/OS-Copilot/ScreenSpot-v2)
+</div>
+![os-atlas](https://github.com/user-attachments/assets/cf2ee020-5e15-4087-9a7e-75cc43662494)
+## Quick Start
+OS-Atlas-Base-7B is a GUI grounding model finetuned from [Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct).
+**Notes:** Our models accept images of any size as input. The model outputs are normalized to relative coordinates within a 0-1000 range (either a center point or a bounding box defined by top-left and bottom-right coordinates). For visualization, please remember to convert these relative coordinates back to the original image dimensions.
+### Inference Example
+First, ensure that the necessary dependencies are installed:
+```
+pip install transformers
+pip install qwen-vl-utils
+```
+Inference code example:
+```python
+from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
+from qwen_vl_utils import process_vision_info
+# Default: Load the model on the available device(s)
+model = Qwen2VLForConditionalGeneration.from_pretrained(
+    "OS-Copilot/OS-Atlas-Base-7B", torch_dtype="auto", device_map="auto"
+)
+processor = AutoProcessor.from_pretrained("OS-Copilot/OS-Atlas-Base-7B")
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "image",
+                "image": "https://github.com/OS-Copilot/OS-Atlas/blob/main/exmaples/images/web_6f93090a-81f6-489e-bb35-1a2838b18c01.png",
+            },
+            {"type": "text", "text": "In this UI screenshot, what is the position of the element corresponding to the command \"switch language of current page\" (with bbox)?"},
+        ],
+    }
+]
+# Preparation for inference
+text = processor.apply_chat_template(
+    messages, tokenize=False, add_generation_prompt=True
+)
+image_inputs, video_inputs = process_vision_info(messages)
+inputs = processor(
+    text=[text],
+    images=image_inputs,
+    videos=video_inputs,
+    padding=True,
+    return_tensors="pt",
+)
+inputs = inputs.to("cuda")
+# Inference: Generation of the output
+generated_ids = model.generate(**inputs, max_new_tokens=128)
+generated_ids_trimmed = [
+    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+]
+output_text = processor.batch_decode(
+    generated_ids_trimmed, skip_special_tokens=False, clean_up_tokenization_spaces=False
+)
+print(output_text)
+# <|object_ref_start|>language switch<|object_ref_end|><|box_start|>(576,12),(592,42)<|box_end|><|im_end|>
+```
+## Citation
+If you find this repository helpful, feel free to cite our paper:
+```bibtex
+@article{wu2024atlas,
+        title={OS-ATLAS: A Foundation Action Model for Generalist GUI Agents},
+        author={Wu, Zhiyong and Wu, Zhenyu and Xu, Fangzhi and Wang, Yian and Sun, Qiushi and Jia, Chengyou and Cheng, Kanzhi and Ding, Zichen and Chen, Liheng and Liang, Paul Pu and others},
+        journal={arXiv preprint arXiv:2410.23218},
+        year={2024}
+      }
+```