Namyoung Kim commited on
Commit
12fa2c2
·
1 Parent(s): 76a0baa
index.html CHANGED
@@ -19,6 +19,34 @@
19
  <script src="./static/js/bulma-carousel.min.js"></script>
20
  <script src="./static/js/bulma-slider.min.js"></script>
21
  <script src="./static/js/index.js"></script>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
  </head>
23
  <body>
24
 
@@ -81,26 +109,404 @@
81
  </div>
82
  </section>
83
 
 
84
  <section class="hero is-light">
85
  <div class="hero-body">
86
  <div class="container is-max-desktop">
87
  <div class="columns is-centered">
88
  <div class="column">
89
  <div class="content has-text-centered">
90
- <img src="./static/images/figure1_wma_overview.png" alt="WMA Overview">
91
  </div>
92
  </div>
93
  </div>
94
  </div>
95
  </div>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
96
  </section>
97
 
98
  <section class="section">
99
  <div class="container is-max-desktop">
100
  <div class="columns is-centered">
101
  <div class="column">
102
- <h2 class="title is-2">Abstract</h2>
103
- <p>Large language models (LLMs) have recently gained much attention in building autonomous agents. However, the performance of current LLM-based web agents in long-horizon tasks is far from optimal, often yielding errors such as repeatedly buying a non-refundable flight ticket. By contrast, humans can avoid such an irreversible mistake, as we have an awareness of the potential outcomes (e.g., losing money) of our actions, also known as the "world model." Motivated by this, our study first starts with preliminary analyses, confirming the absence of world models in current LLMs (e.g., GPT-4, Claude-3.5-Sonnet, etc.). Then, we present a World-Model-Augmented (WMA) web agent, which simulates the outcomes of its actions for better decision-making. To overcome the challenges in training LLMs as world models predicting next observations, such as repeated elements across observations and long HTML inputs, we propose a transition-focused observation abstraction, where the prediction objectives are free-form natural language descriptions exclusively highlighting important state differences between time steps. Experiments on WebArena and Mind2Web show that our world models improve agents' policy selection without training and demonstrate our agents' cost- and time-efficiency compared to recent tree-search-based agents.</p>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
104
  </div>
105
  </div>
106
  </div>
 
19
  <script src="./static/js/bulma-carousel.min.js"></script>
20
  <script src="./static/js/bulma-slider.min.js"></script>
21
  <script src="./static/js/index.js"></script>
22
+ <script type="text/javascript" async
23
+ src="https://polyfill.io/v3/polyfill.min.js?features=es6"></script>
24
+ <script type="text/javascript" async
25
+ src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/3.2.0/es5/tex-mml-chtml.js"></script>
26
+ <script>
27
+ function showStep(stepNumber) {
28
+ // Hide all step contents
29
+ var stepContents = document.querySelectorAll('.step-content');
30
+ for (var i = 0; i < stepContents.length; i++) {
31
+ stepContents[i].style.display = 'none';
32
+ }
33
+
34
+ // Remove active class from all tabs
35
+ var tabs = document.querySelectorAll('.tabs li');
36
+ for (var i = 0; i < tabs.length; i++) {
37
+ tabs[i].classList.remove('is-active');
38
+ }
39
+
40
+ // Show the selected step content and activate its tab
41
+ document.getElementById('step' + stepNumber + '-content').style.display = 'block';
42
+ document.getElementById('step' + stepNumber + '-tab').classList.add('is-active');
43
+ }
44
+
45
+ // Initialize when DOM is fully loaded
46
+ document.addEventListener('DOMContentLoaded', function() {
47
+ showStep(1);
48
+ });
49
+ </script>
50
  </head>
51
  <body>
52
 
 
109
  </div>
110
  </section>
111
 
112
+ <!--
113
  <section class="hero is-light">
114
  <div class="hero-body">
115
  <div class="container is-max-desktop">
116
  <div class="columns is-centered">
117
  <div class="column">
118
  <div class="content has-text-centered">
119
+ <img src="static/images/figure1_wma_overview.png" alt="WMA Overview">
120
  </div>
121
  </div>
122
  </div>
123
  </div>
124
  </div>
125
+ </section> -->
126
+ <section class="section" style="padding-top: 0px;">
127
+ <div class="container is-max-desktop">
128
+ <div class="columns is-centered">
129
+ <div class="column">
130
+ <div class="box has-background-white" style="box-shadow: 0 0.5em 1em -0.125em rgba(10, 10, 10, 0.3), 0 0 0 1px rgba(10, 10, 10, 0.05);">
131
+ <h2 class="title is-2">Overview</h2>
132
+ <br>
133
+ <div class="content has-text-centered">
134
+ <img src="static/images/figure1_wma_overview.png" alt="WMA Web Agent">
135
+ </div>
136
+ <p>Large language models (LLMs) have recently gained much attention in building autonomous agents. However, the performance of current LLM-based web agents in long-horizon tasks is far from optimal, often yielding errors such as repeatedly buying a non-refundable flight ticket. By contrast, humans can avoid such an irreversible mistake, as we have an awareness of the potential outcomes (e.g., losing money) of our actions, also known as the <strong>"world model"</strong>. Motivated by this, our study first starts with preliminary analyses, confirming the absence of world models in current LLMs (e.g., GPT-4, Claude-3.5-Sonnet, etc.). Then, we present a <strong>World-Model-Augmented (WMA) web agent</strong>, which simulates the outcomes of its actions for better decision-making. To overcome the challenges in training LLMs as world models predicting next observations, such as repeated elements across observations and long HTML inputs, we propose a transition-focused observation abstraction, where the prediction objectives are free-form natural language descriptions exclusively highlighting important state differences between time steps. Experiments on <a href="https://github.com/web-arena-x/webarena">WebArena</a> and <a href="https://github.com/OSU-NLP-Group/Mind2Web">Mind2Web</a> show that our world models improve agents' policy selection without training and demonstrate our agents' cost- and time-efficiency compared to recent tree-search-based agents.
137
+ </p>
138
+ <br>
139
+ <div class="content">
140
+ <h3 class="title is-4">🌍 News</h3>
141
+ <ul>
142
+ <li><strong>[2025/01/22] WMA Web Agent is accepted by ICLR 2025!</strong></li>
143
+ <li><strong>[2024/06/12] WMA Web Agent is out!</strong></li>
144
+ </ul>
145
+ </div>
146
+ </div>
147
+ </div>
148
+ </div>
149
+ </div>
150
+ </section>
151
+
152
+ <section class="section">
153
+ <div class="container is-max-desktop">
154
+ <div class="columns is-centered">
155
+ <div class="column">
156
+ <h2 class="title is-2">Methodology</h2>
157
+ <div class="tabs">
158
+ <ul>
159
+ <li id="step1-tab" class="is-active"><a href="#" onclick="showStep(1); return false;">Phase I: World Model Training</a></li>
160
+ <li id="step2-tab"><a href="#" onclick="showStep(2); return false;">Phase II: Inference-Time Policy Optimization with the World Model</a></li>
161
+ </ul>
162
+ </div>
163
+ <div id="step1-content" class="content step-content">
164
+ <div class="content has-text-centered">
165
+ <img src="static/images/method_1.png" alt="World Model Training Methodology">
166
+ </div>
167
+ <h3 class="title is-4">Step I: Harvesting Agent-Environment Interaction Data</h3>
168
+ <p>
169
+ <p>
170
+ We start by collecting the dataset
171
+ \( \mathcal{D} = \sum^{n}_{t=1} \{ I, o_t, a_t, o_{t+1} \} \)
172
+ from the environment \( \mathcal{E} \) for training world models.
173
+ For that, we prompt an LLM as a web agent to achieve the goal provided in the user instruction \( I \),
174
+ by iteratively predicting an action \( a_t \) based on the current observation \( o_t \)
175
+ throughout all \( n \) time steps.
176
+ Consequently, we obtain \( \mathcal{D} \) from trajectory
177
+ \( \tau = \{o_1, a_1, o_2, ..., a_{n}, o_{n+1}\} \) based on \( I \),
178
+ and environment states of \( n \) time steps
179
+ \( \{s_1, ..., s_{n+1}\} \subset \mathcal{S} \)
180
+ obtained via transition function \( \mathcal{T} \).
181
+ </p>
182
+
183
+ </p>
184
+ <h3 class="title is-4">Step II: Transition-Focused Observation Abstraction</h3>
185
+ <p>
186
+ With the collected data
187
+ \( \mathcal{D} = \sum^{n}_{t=1} \{ I, o_t, a_t, o_{t+1} \} \),
188
+ it is intuitive to train LLM-based world models to predict \( o_{t+1} \),
189
+ which is expressed with texts (e.g., HTML and accessibility tree).
190
+ </p>
191
+ <div class="has-text-centered">
192
+ <figure class="image" style="width: 80%; margin: 0 auto;">
193
+ <img src="static/images/figure_5.png" alt="Figure 5: Transition-Focused Observation Abstraction">
194
+ </figure>
195
+ </div>
196
+ <p>As shown in Figure 5, we first (i) apply the Hungarian algorithm
197
+ to calculate a cost matrix for matching elements between
198
+ \( o_t \) and \( o_{t+1} \) and (ii) mechanically transform the results into a list of state transition
199
+ \( \Delta(o_t, o_{t+1}) \), pointing out <code>UPDATED</code>, <code>DELETED</code>, and <code>ADDED</code> elements on the web.
200
+ After that, we prompt an LLM to convert the extracted \( \Delta(o_t, o_{t+1}) \) into a free-form natural language
201
+ description \( \tilde{o}_{t+1} \), which highlights the difference between the new observation \( o_{t+1} \) and \( o_t \).
202
+ Replacing \( o_{t+1} \) in
203
+ \( \mathcal{D} = \{ I, o_t, a_t, o_{t+1} \} \) collected in Step I with \( \tilde{o}_{t+1} \) we just acquired here,
204
+ we get a final dataset
205
+ \( \tilde{\mathcal{D}} = \sum^{n}_{t=1} \{ I, o_t, a_t, \tilde{o}_{t+1} \} \)
206
+ for training world models.</p>
207
+ <h3 class="title is-4">Step III: Learning Environment Dynamics</h3>
208
+ <p>
209
+ Lastly, using \( \tilde{\mathcal{D}} \), we proceed to train the internal world model \( \phi \) of the web agent
210
+ to learn the environment dynamics. Formally, an LLM working as the world model is trained to predict
211
+ the abstracted observation \( \tilde{o} \) of the next state \( s_{t+1} \), given three inputs:
212
+ the user instruction \( I \), the current observation \( o_t \), and the current action \( a_t \).
213
+ This LLM is trained to minimize the following loss term via the next-token prediction objective:
214
+ </p>
215
+ <p>
216
+ \[
217
+ \mathcal{L}_{\phi} = -\log \sum_{(\tilde{o}, o, a, I) \in \tilde{\mathcal{D}}} p(\tilde{o}_{t+1}| o_t, a_t, I)
218
+ \]
219
+ </p>
220
+ </div>
221
+ <div id="step2-content" class="content step-content" style="display: none;">
222
+ <div class="content has-text-centered">
223
+ <img src="static/images/method_2.png" alt="Inference-Time Policy Optimization with the World Model">
224
+ </div>
225
+ <p>
226
+ During inference at time \( t \) with a current observation \( o_t \), the WMA web agent utilizes the world model \( \phi \) to foresee how an action can affect the state (i.e., predict \( \tilde{o}_{t+1}^i \)), and accordingly finds an optimal action/policy \( a_t \) from the policy model \( \theta \) that leads to the target goal defined in \( \mathcal{I} \).
227
+ </p>
228
+ <p>
229
+ We begin by sampling \( k \) action candidates
230
+ \( \{a_t^1, a_t^2, ..., a_t^k\} \) from \( \theta \) via top-\( p \) decoding, to conduct diverse exploration on future observations
231
+ \( \{o_{t+1}^1, o_{t+1}^2, ..., o_{t+1}^k\} \).
232
+ Next, we use the world model \( \phi \) to "<em>simulate</em>" the potential next observation \( \tilde{o}_{t+1}^i \) caused by each action candidate \( a_t \):
233
+ </p>
234
+
235
+ <p>
236
+ \[
237
+ \{\tilde{o}_{t+1}^i\}_{i=1}^k = \{\phi(o_t, a_t^i, I)\}_{i=1}^k
238
+ \]
239
+ </p>
240
+
241
+ <p>
242
+ Lastly, we decide the agent's action for actual operation by selecting the action leading to the most optimal future state \( s_{t+1} \) from all action candidates.
243
+ We use an off-the-shelf LLM as a value function \( V(\cdot) \) to estimate the reward yielded by each action candidate
244
+ and select the action \( \hat{a}_t \) with the highest reward:
245
+ </p>
246
+ <p>
247
+ \[
248
+ \hat{a}_t = \underset{a_t \in \{a_t^1, ..., a_t^k\}}{\text{argmax}} \, V(I, o_t, a_t, \tilde{o}_{t+1}^i)
249
+ \]
250
+ </p>
251
+ </div>
252
+ </div>
253
+ </div>
254
+ </div>
255
+ </section>
256
+
257
+ <section class="section" style="padding-top: 0; padding-bottom: 0;">
258
+ <div class="container is-max-desktop">
259
+ <div class="columns is-centered">
260
+ <div class="column">
261
+ <hr style="height: 2px; background-color: #dbdbdb; margin: 2rem 0;">
262
+ </div>
263
+ </div>
264
+ </div>
265
+ </section>
266
+
267
+ <section class="section">
268
+ <div class="container is-max-desktop">
269
+ <div class="columns is-centered">
270
+ <div class="column">
271
+ <h2 class="title is-2">Experiments Setup</h2>
272
+ <div class="content">
273
+ <h3 class="title is-4">Benchmarks and evaluation metrics</h3>
274
+ <p>For evaluation, we use the official <a href="https://github.com/web-arena-x/webarena">WebArena</a> and
275
+ <a href="https://github.com/OSU-NLP-Group/Mind2Web">Mind2Web</a> benchmarks. WebArena includes 812 real-life tasks
276
+ in simulated environments across five different websites, spanning four key domains - e-commerce
277
+ (Shopping), social forums (Reddit), collaborative software development (Gitlab), content manage-
278
+ ment (CMS), and Map. The main
279
+ metric, Success Rate (SR), is calculated as the percentage of the user instructions that are success-
280
+ fully accomplished by the generated agent trajectory. On the other hand, Mind2Web covers over 2,000 open-ended tasks, collected from 137 websites of 31 domains and crowd-
281
+ sourced action sequences for the tasks. Along with the SR, Mind2Web also uses Step SR, which
282
+ measures whether the predicted action selects both the correct action type (action F1) and element
283
+ ID (element accuracy). When the agent succeeds in all steps in a trajectory, it is evaluated as success.</p>
284
+ </div>
285
+ </div>
286
+ </div>
287
+ </div>
288
+ </section>
289
+
290
+ <section class="section" style="padding-top: 0; padding-bottom: 0;">
291
+ <div class="container is-max-desktop">
292
+ <div class="columns is-centered">
293
+ <div class="column">
294
+ <hr style="height: 2px; background-color: #dbdbdb; margin: 2rem 0;">
295
+ </div>
296
+ </div>
297
+ </div>
298
  </section>
299
 
300
  <section class="section">
301
  <div class="container is-max-desktop">
302
  <div class="columns is-centered">
303
  <div class="column">
304
+ <h2 class="title is-2">Results</h2>
305
+ <div class="content">
306
+ <h3 class="title is-4">Agent Performance in WebArena</h3>
307
+ <div class="has-text-centered">
308
+ <figure class="image">
309
+ <img src="static/images/table_1.png" alt="Table 1">
310
+ </figure>
311
+ <figure class="image">
312
+ <img src="static/images/table_2.png" alt="Table 2" style="width: 90%; margin: 0 auto;">
313
+ </figure>
314
+ </div>
315
+ <br>
316
+ <p>From our experiments in Table 1 and Table 2, we observed the following results:</p>
317
+ <ul>
318
+ <li><strong>WMA vs. Vanilla CoT</strong>
319
+ <ul>
320
+ <li>WMA web agent achieves a 16.6% success rate compared to 13.1% for vanilla CoT.</li>
321
+ <li>Significant improvements are observed across almost all domains in WebArena (see Table 2).</li>
322
+ </ul>
323
+ </li>
324
+ <br>
325
+ <li><strong>Performance Gains with GPT-4o-mini</strong>
326
+ <ul>
327
+ <li>181% performance gain over CoT in the Gitlab domain.</li>
328
+ <li>92% performance gain over CoT in the Map domain.</li>
329
+ </ul>
330
+ </li>
331
+ <br>
332
+ <li><strong>Comparison with Tree Search Agent (Koh et al., 2024)</strong>
333
+ <ul>
334
+ <li>The Tree search agent has a slightly higher absolute success rate (19.2%) compared to the WMA agent (16.6%).</li>
335
+ <li>The WMA agent shows a larger performance improvement over vanilla CoT (+29.7%) than the Tree search agent (+28.0%).</li>
336
+ </ul>
337
+ </li>
338
+ </ul>
339
+ </div>
340
+ <br>
341
+ <div class="content">
342
+ <h3 class="title is-4">Agent Performance in Mind2Web</h3>
343
+ <figure class="image">
344
+ <img src="static/images/table_3.png" alt="Table 3">
345
+ </figure>
346
+ <p>From our experiments in Table 3, we observed the following results:</p>
347
+ <ul>
348
+ <li><strong>Comparison with Previous SOTA Methods</strong>
349
+ <ul>
350
+ <li>WMA web agent is compared with MindAct (Deng et al., 2024) and AWM (Wang et al., 2024b).</li>
351
+ <li>WMA web agent significantly outperforms AWM, achieving new SOTA performance.</li>
352
+ </ul>
353
+ </li>
354
+ <br>
355
+ <li><strong>Generalization Capability of WMA</strong>
356
+ <ul>
357
+ <li>WMA web agent, trained on Mind2Web data, shows strong generalization capabilities.</li>
358
+ <li>This makes our approach much more valuable in scenarios where
359
+ collecting data for new web environments is non-trivial.</li>
360
+ </ul>
361
+ </li>
362
+ </ul>
363
+ </div>
364
+ </div>
365
+ </div>
366
+ </div>
367
+ </section>
368
+
369
+ <section class="section" style="padding-top: 0; padding-bottom: 0;">
370
+ <div class="container is-max-desktop">
371
+ <div class="columns is-centered">
372
+ <div class="column">
373
+ <hr style="height: 2px; background-color: #dbdbdb; margin: 2rem 0;">
374
+ </div>
375
+ </div>
376
+ </div>
377
+ </section>
378
+
379
+ <section class="section">
380
+ <div class="container is-max-desktop">
381
+ <div class="columns is-centered">
382
+ <div class="column">
383
+ <h2 class="title is-2">Analysis</h2>
384
+ <div class="content">
385
+ <h3 class="title is-4">Time and Cost Efficiency</h3>
386
+ <div class="has-text-centered">
387
+ <figure class="image">
388
+ <img src="static/images/table_4.png" alt="Table 4">
389
+ </figure>
390
+ </div>
391
+ <ul>
392
+ <li><strong>Time Efficiency</strong>
393
+ <ul>
394
+ <li>Tree search agent takes an average of 748.3 seconds per user instruction due to state exploration and backtracing.</li>
395
+ <li>WMA web agent completes the same task in only 140.3 seconds by simulating actions instead of executing them.</li>
396
+ <li>WMA is 5.3 times faster than Tree search agent.</li>
397
+ </ul>
398
+ </li>
399
+ <br>
400
+ <li><strong>API Cost Efficiency</strong>
401
+ <ul>
402
+ <li>Tree search agent incurs 6.8 times higher API costs due to its multi-modal inputs.</li>
403
+ </ul>
404
+ </li>
405
+ </ul>
406
+ </div>
407
+ <br>
408
+ <div class="content">
409
+ <h3 class="title is-4">Ablation Study</h3>
410
+ <p>
411
+ We conduct several ablation studies on our WMA web agent with 200 randomly sampled instances from WebArena (Shopping: 50; Gitlab: 50; Map: 100). We use GPT-4o-mini as policy models.
412
+ </p>
413
+ <div class="has-text-centered">
414
+ <figure class="image" style="width: 90%; margin: 0 auto;">
415
+ <img src="static/images/table_5.png" alt="Table 5">
416
+ </figure>
417
+ </div>
418
+ <br>
419
+ <p>
420
+ We observe the following findings in Table 5:
421
+ <ul>
422
+ <li>Accessing simulated next states in reward estimation improves agent performance.</li>
423
+ <li>Fine-tuning facilitates better world models than prompt-based approaches.</li>
424
+ <li>Abstracting observation elicits better next state prediction.</li>
425
+ </ul>
426
+ </p>
427
+ <br>
428
+ <div class="has-text-centered">
429
+ <div class="columns is-centered">
430
+ <div class="column is-half has-text-centered">
431
+ <figure class="image" style="width: 80%; margin: 0 auto;">
432
+ <img src="static/images/table_6.png" alt="Table 6">
433
+ </figure>
434
+ </div>
435
+ <div class="column is-half has-text-centered">
436
+ <figure class="image" style="width: 80%; margin: 0 auto;">
437
+ <img src="static/images/figure_6.png" alt="Figure 6: Qualitative Analysis">
438
+ </figure>
439
+ </div>
440
+ </div>
441
+ </div>
442
+ <p>
443
+ Additionally, we reveal the following findings in Table 6 and Figure 6:
444
+ <ul>
445
+ <li>Fine-tuning the value function is a reasonable alternative in scenarios where API budgets are limited.</li>
446
+ <li>Our WMA web agent may benefit from more exploration of the future states when the budget is allowed.</li>
447
+ </ul>
448
+ </p>
449
+ </div>
450
+ </div>
451
+ </div>
452
+ </div>
453
+ </section>
454
+
455
+ <section class="section" style="padding-top: 0; padding-bottom: 0;">
456
+ <div class="container is-max-desktop">
457
+ <div class="columns is-centered">
458
+ <div class="column">
459
+ <hr style="height: 2px; background-color: #dbdbdb; margin: 2rem 0;">
460
+ </div>
461
+ </div>
462
+ </div>
463
+ </section>
464
+
465
+ <section class="section">
466
+ <div class="container is-max-desktop">
467
+ <div class="columns is-centered">
468
+ <div class="column">
469
+ <h2 class="title is-2">Case Study</h2>
470
+ <div class="content">
471
+ <p>
472
+ WMA web agent successfully inferences on Gitlab
473
+ domain in the WebArena benchmark (instance #175). Using the policy model (i.e., GPT-4o), WMA
474
+ web agent selects the most proper action click [88] by leveraging its learned environment dynamics.
475
+ </p>
476
+ <div class="has-text-centered">
477
+ <figure class="image" style="width: 100%; margin: 0 auto;">
478
+ <img src="static/images/case.png" alt="Case Study Example">
479
+ </figure>
480
+ </div>
481
+ </div>
482
+ </div>
483
+ </div>
484
+ </div>
485
+ </section>
486
+
487
+
488
+ <section class="section" style="padding-top: 0; padding-bottom: 0;">
489
+ <div class="container is-max-desktop">
490
+ <div class="columns is-centered">
491
+ <div class="column">
492
+ <hr style="height: 2px; background-color: #dbdbdb; margin: 2rem 0;">
493
+ </div>
494
+ </div>
495
+ </div>
496
+ </section>
497
+
498
+ <section class="section">
499
+ <div class="container is-max-desktop">
500
+ <div class="columns is-centered">
501
+ <div class="column">
502
+ <h2 class="title is-2">Citation</h2>
503
+ <div class="content">
504
+ <pre style="white-space: pre-wrap; word-wrap: break-word;"><code>@inproceedings{chae2024web,
505
+ title={Web agents with world models: Learning and leveraging environment dynamics in web navigation},
506
+ author={Chae, Hyungjoo and Kim, Namyoung and Ong, Kai Tzu-iunn and Gwak, Minju and Song, Gwanwoo and Kim, Jihoon and Kim, Sunghwan and Lee, Dongha and Yeo, Jinyoung},
507
+ booktitle={The Thirteenth International Conference on Learning Representations}
508
+ }</code></pre>
509
+ </div>
510
  </div>
511
  </div>
512
  </div>
static/images/case.png ADDED
static/images/figure_5.png ADDED
static/images/figure_6.png ADDED
static/images/method_1.png ADDED

Git LFS Details

  • SHA256: 6feb799db9fe0edfb9d7b89d1310d0cc97830d15498e71bfc4315ceb24f2c948
  • Pointer size: 131 Bytes
  • Size of remote file: 979 kB
static/images/method_2.png ADDED

Git LFS Details

  • SHA256: fa4e53f32b05ffcce55ebe3af5192f43b29fcad6e3b03cbf6d25da3a1495f5e4
  • Pointer size: 131 Bytes
  • Size of remote file: 984 kB
static/images/table_1.png ADDED
static/images/table_2.png ADDED
static/images/table_3.png ADDED
static/images/table_4.png ADDED
static/images/table_5.png ADDED
static/images/table_6.png ADDED