PyTorch
llama
File size: 8,947 Bytes
f99e404
 
 
 
 
 
77fe6c5
 
f99e404
a76121d
 
 
 
77fe6c5
a76121d
ad47531
a76121d
 
 
 
 
 
 
cd9d5b6
a76121d
 
 
 
 
cd9d5b6
 
 
a76121d
 
 
 
 
cd9d5b6
a76121d
 
76d38ec
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a76121d
76d38ec
a76121d
76d38ec
a76121d
76d38ec
 
 
 
a76121d
 
 
5425faf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a76121d
 
 
76d38ec
a76121d
 
 
 
 
cd9d5b6
a76121d
 
 
 
 
cd9d5b6
a76121d
 
 
 
 
 
cd9d5b6
a76121d
 
 
 
 
 
 
 
 
 
dee9bcb
a76121d
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
---
license: apache-2.0
metrics:
- accuracy
base_model:
- meta-llama/Llama-3.1-8B-Instruct
datasets:
- stanfordnlp/nnetnav-wa
---

# Model Card for Llama8b-NNetNav-WA

<!-- Provide a quick summary of what the model is/does. [Optional] -->
LLama8b-NNetNav-WA is a [LLama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) model that is instruct-tuned with [NNetNav-WA](https://huggingface.co/datasets/stanfordnlp/nnetnav-wa) data collected via unsupervised exploration on [WebArena](http://webarena.dev) websites, with a larger [LLama-3.1-70B](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) model.

More details about this model can be found in our paper: [NNetNav: Unsupervised Learning of Browser Agents Through Environment Interaction in the Wild](https://arxiv.org/abs/2410.02907).


##  Table of Contents

- [Model Card for Llama8b-NNetNav-WA](#model-card-for--model_id-)
- [Table of Contents](#table-of-contents)
- [Model Details](#model-details)
- [Results on Web-Agent Benchmarks](#results-on-benchmarks)
- [Bias, Risks, and Limitations](#bias-risks-and-limitations)
- [Training Details](#training-details)
  - [Training Data](#training-data)
  - [Training Procedure](#training-procedure)
- [Environmental Impact](#environmental-impact)
- [Technical Specifications](#technical-specifications)
  - [Hardware](#hardware)
  - [Software](#software)
- [Model Card Authors [optional]](#model-card-authors-optional)
- [Model Card Contact](#model-card-contact)
- [How to Get Started with the Model](#how-to-get-started-with-the-model)

## Model Details
This model is intended to be used as a **web-agent** i.e. given an instruction such as _Upvote the post by user smurty123 on subreddit r/LocalLLaMA_, and a web-url _reddit.com_, the model can perform the task by executing a sequence of actions.

<!-- Provide a longer summary of what this model is/does. -->
The action space of the model is as follows:
```plaintext
Page Operation Actions:
`click [id]`: This action clicks on an element with a specific id on the webpage.
`type [id] [content] [press_enter_after=0|1]`: Use this to type the content into the field with id. By default, the "Enter" key is pressed after typing unless press_enter_after is set to 0.
`hover [id]`: Hover over an element with id.
`press [key_comb]`:  Simulates the pressing of a key combination on the keyboard (e.g., Ctrl+v).
`scroll [down|up]`: Scroll the page up or down.

Tab Management Actions:
`new_tab`: Open a new, empty browser tab.
`tab_focus [tab_index]`: Switch the browser's focus to a specific tab using its index.
`close_tab`: Close the currently active tab.

URL Navigation Actions:
`goto [url]`: Navigate to a specific URL.
`go_back`: Navigate to the previously viewed page.
`go_forward`: Navigate to the next page (if a previous 'go_back' action was performed).

Completion Action:
`stop [answer]`: Issue this action when you believe the task is complete. If the objective is to find a text-based answer, provide the answer in the bracket. If you believe the task is impossible to complete, provide the answer as "N/A" in the bracket.
```

## Results on Benchmarks

This model gets the following results on WebArena and WebVoyager:

| Model                  | WebArena (SR) | WebVoyager (SR) |
|------------------------|--------------:|---------------:|
| **GPT-4**             | **14.1**      | **33.5**      |
| **llama8b-nnetnav-wa** | **16.3**      | **28.1**      |


## Bias, Risks, and Limitations

### **Bias**
As with all ML models, **Llama8b-NNetNav-WA** inherits biases from its training data. Since the dataset is collected via unsupervised exploration on self-hosted WebArena websites, it will reflect biases present in website structures, navigation flows, and content representations.

- **Selection Bias:** The model is trained on Self-hosted websites that mimic reddit, github, google maps, simple e-commerce websites and CMS websites. This model is likely to struggle with websites with modern layouts seen on live websites.   
- **Demographic Bias:** WebArena self-hosted websites over-represent Western English-speaking users, and the model may perform worse on non-English or culturally distinct websites.  
  - Example: A model trained mostly on U.S. e-commerce sites may navigate amazon.com effectively but may struggle with Flipkart (India) or Rakuten (Japan).

If you are interested in training a NNetNav based agent for your own domain, please check out our [codebase](https://github.com/MurtyShikhar/NNetnav). Or if you're interested in a model that has been shown to work well on a variety of live websites, please check out [LLama8b-NNetNav-Live](https://huggingface.co/stanfordnlp/llama8b-nnetnav-live)

### **Risks**
#### **1. Unintended Actions**
The model operates by executing web actions based on textual observation spaces, which may lead to unintended consequences when dealing with ambiguous or poorly structured websites.

- If instructed to "delete all spam messages in my inbox," but the website has unusual button placement in the AXTree, the model might mistakenly delete important emails instead.
- If asked to "buy the cheapest laptop on Amazon," the model might select an accessory instead of an actual laptop if the AXTree of the listing page has misleading layout

#### **2. Security & Privacy Risks**
Since the model interacts with external web content, there are significant risks related to unintentional data exposure, credential leaks, and interaction with harmful content.

- If asked to "log into my Gmail and check unread emails," the model may type and submit credentials without realizing it, potentially exposing passwords.
- A user asking the model to "search for free software downloads" might inadvertently lead to interactions with phishing or malware-hosting sites.

#### **3. Adversarial Manipulation**
Malicious websites can deceive the model by using **dark patterns**—UI/UX tricks that mislead users (or bots).

- A fraudulent website may create **fake "Close" buttons** in the AXTree that actually trigger **downloads or pop-ups**. The model, thinking it's closing a window, may instead **click a malicious link**.
- If asked to "unsubscribe from a newsletter," but the page uses **misleading button labels** in the AXTree (e.g., "Unsubscribe" actually means "Resubscribe"), the model could perform the opposite action.

#### **4. Legal & Ethical Considerations**
Web navigation often involves handling user-generated content, news, and e-commerce transactions, all of which pose ethical and legal challenges.

- If instructed to "find the latest election results," the model might click on a misleading news source, potentially spreading misinformation.
- If asked to "find the cheapest flight ticket," it could unintentionally violate terms of service by scraping restricted airline data.

### **Limitations**
#### **1. Generalization to Unseen Websites**
This model is trained via interaction on 5 self-hosted WebArena websites, and is known to struggle on real, live websites. Please check out [LLama8b-NNetNav-Live](https://huggingface.co/stanfordnlp/llama8b-nnetnav-live) for a model that performs better on live websites.

#### **2. Instruction Sensitivity**
Vague instructions can lead to unintended actions.

- "Find me the best laptop for gaming" is **subjective**, and the model might select a **random option** instead of following some criteria (e.g., GPU, refresh rate).

#### **3. Performance on Long-Horizon Tasks**
The model may struggle when tasks require **deep memory retention, complex multi-step planning, or backtracking**.

- *Example:* When booking a hotel on a travel website, the model might navigate **through multiple filters and options** but forget previous selections when reaching the checkout page.

#### **4. Token Limitations**
The model's **maximum sequence length of 20k tokens** limits its ability to handle long, continuous web interactions.

- *Example:* When filling a very long multi-step form, the model might forget earlier responses, leading to errors.



## How to Get Started with the Model

TODO

## Training Details

### Training Data

This model was trained with SFT on the [NNetnav-WA](https://huggingface.co/datasets/stanfordnlp/nnetnav-wa) dataset, which is comprised of synthetic demonstrations entirely from self-hosted websites. 

### Training Procedure

This model was trained for 2 epochs (roughly 4k gradient steps) with a batch size of 128, and a maximum sequence length of 20000.

## Environmental Impact

- **Hardware Type:** 4 H100 GPUs (80G)
- **Hours used:** Roughly 2 days.
- **Cloud Provider:** Stanford compute.
- **Compute Region:** Stanford energy grid.

## Technical Specifications

### Hardware

This model was trained on 4 H100s.

### Software

This model was fine-tuned with [Open-Instruct](https://github.com/allenai/open-instruct/tree/main)


## Model Card Authors

Shikhar Murty

## Model Card Contact

smurty@cs.stanford.edu