Update README.md
Browse files
README.md
CHANGED
@@ -72,8 +72,61 @@ This model gets the following results on WebArena and WebVoyager:
|
|
72 |
|
73 |
|
74 |
## Bias, Risks, and Limitations
|
75 |
-
|
76 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
77 |
|
78 |
## How to Get Started with the Model
|
79 |
|
|
|
72 |
|
73 |
|
74 |
## Bias, Risks, and Limitations
|
75 |
+
|
76 |
+
### **Bias**
|
77 |
+
As with all ML models, **Llama8b-NNetNav-WA** inherits biases from its training data. Since the dataset is collected via unsupervised exploration on self-hosted WebArena websites, it will reflect biases present in website structures, navigation flows, and content representations.
|
78 |
+
|
79 |
+
- **Selection Bias:** The model is trained on Self-hosted websites that mimic reddit, github, google maps, simple e-commerce websites and CMS websites. This model is likely to struggle with websites with modern layouts seen on live websites.
|
80 |
+
- **Demographic Bias:** WebArena self-hosted websites over-represent Western English-speaking users, and the model may perform worse on non-English or culturally distinct websites.
|
81 |
+
- Example: A model trained mostly on U.S. e-commerce sites may navigate amazon.com effectively but may struggle with Flipkart (India) or Rakuten (Japan).
|
82 |
+
|
83 |
+
If you are interested in training a NNetNav based agent for your own domain, please check out our [codebase](https://github.com/MurtyShikhar/NNetnav). Or if you're interested in a model that has been shown to work well on a variety of live websites, please check out [LLama8b-NNetNav-Live](https://huggingface.co/stanfordnlp/llama8b-nnetnav-live)
|
84 |
+
|
85 |
+
### **Risks**
|
86 |
+
#### **1. Unintended Actions**
|
87 |
+
The model operates by executing web actions based on textual observation spaces, which may lead to unintended consequences when dealing with ambiguous or poorly structured websites.
|
88 |
+
|
89 |
+
- If instructed to "delete all spam messages in my inbox," but the website has unusual button placement in the AXTree, the model might mistakenly delete important emails instead.
|
90 |
+
- If asked to "buy the cheapest laptop on Amazon," the model might select an accessory instead of an actual laptop if the AXTree of the listing page has misleading layout
|
91 |
+
|
92 |
+
#### **2. Security & Privacy Risks**
|
93 |
+
Since the model interacts with external web content, there are significant risks related to unintentional data exposure, credential leaks, and interaction with harmful content.
|
94 |
+
|
95 |
+
- If asked to "log into my Gmail and check unread emails," the model may type and submit credentials without realizing it, potentially exposing passwords.
|
96 |
+
- A user asking the model to "search for free software downloads" might inadvertently lead to interactions with phishing or malware-hosting sites.
|
97 |
+
|
98 |
+
#### **3. Adversarial Manipulation**
|
99 |
+
Malicious websites can deceive the model by using **dark patterns**—UI/UX tricks that mislead users (or bots).
|
100 |
+
|
101 |
+
- A fraudulent website may create **fake "Close" buttons** in the AXTree that actually trigger **downloads or pop-ups**. The model, thinking it's closing a window, may instead **click a malicious link**.
|
102 |
+
- If asked to "unsubscribe from a newsletter," but the page uses **misleading button labels** in the AXTree (e.g., "Unsubscribe" actually means "Resubscribe"), the model could perform the opposite action.
|
103 |
+
|
104 |
+
#### **4. Legal & Ethical Considerations**
|
105 |
+
Web navigation often involves handling user-generated content, news, and e-commerce transactions, all of which pose ethical and legal challenges.
|
106 |
+
|
107 |
+
- If instructed to "find the latest election results," the model might click on a misleading news source, potentially spreading misinformation.
|
108 |
+
- If asked to "find the cheapest flight ticket," it could unintentionally violate terms of service by scraping restricted airline data.
|
109 |
+
|
110 |
+
### **Limitations**
|
111 |
+
#### **1. Generalization to Unseen Websites**
|
112 |
+
This model is trained via interaction on 5 self-hosted WebArena websites, and is known to struggle on real, live websites. Please check out [LLama8b-NNetNav-Live](https://huggingface.co/stanfordnlp/llama8b-nnetnav-live) for a model that performs better on live websites.
|
113 |
+
|
114 |
+
#### **2. Instruction Sensitivity**
|
115 |
+
Vague instructions can lead to unintended actions.
|
116 |
+
|
117 |
+
- "Find me the best laptop for gaming" is **subjective**, and the model might select a **random option** instead of following some criteria (e.g., GPU, refresh rate).
|
118 |
+
|
119 |
+
#### **3. Performance on Long-Horizon Tasks**
|
120 |
+
The model may struggle when tasks require **deep memory retention, complex multi-step planning, or backtracking**.
|
121 |
+
|
122 |
+
- *Example:* When booking a hotel on a travel website, the model might navigate **through multiple filters and options** but forget previous selections when reaching the checkout page.
|
123 |
+
|
124 |
+
#### **4. Token Limitations**
|
125 |
+
The model's **maximum sequence length of 20k tokens** limits its ability to handle long, continuous web interactions.
|
126 |
+
|
127 |
+
- *Example:* When filling a very long multi-step form, the model might forget earlier responses, leading to errors.
|
128 |
+
|
129 |
+
|
130 |
|
131 |
## How to Get Started with the Model
|
132 |
|