meghsn's picture
Updated readme for PR
3d7a66f
|
raw
history blame
3.59 kB
metadata
title: BrowserGym Leaderboard
emoji: πŸ†
colorFrom: purple
colorTo: green
sdk: docker
pinned: false
license: mit

BrowserGym Leaderboard

This leaderboard tracks performance of various agents on web navigation tasks.

How to Submit Results for New Agents

1. Create Results Directory

Create a new folder in the results directory with your agent's name:

results/
└── your-agent-name/
    β”œβ”€β”€ README.md
    β”œβ”€β”€ webarena.json
    β”œβ”€β”€ workarena-l1.json
    β”œβ”€β”€ workarena++-l2.json
    β”œβ”€β”€ workarena++-l3.json
    └── miniwob.json

2. Add Agent Details

Create a README.md in your agent's folder with the following details:

Required Information

  • Model Name: Base model used (e.g., GPT-4, Claude-2)
  • Model Architecture: Architecture details and any modifications
  • Input/Output Format: How inputs are processed and outputs generated
  • Training Details: Training configuration if applicable
    • Dataset used
    • Number of training steps
    • Hardware used
    • Training time

Optional Information

  • Paper Link: Link to published paper/preprint if available
  • Code Repository: Link to public code implementation
  • Additional Notes: Any special configurations or requirements
  • License: License information for your agent

Make sure to organize the information in clear sections using Markdown.

3. Add Benchmark Results

Create separate JSON files for each benchmark following this format:

[
    {
        "agent_name": "your-agent-name",
        "study_id": "unique-study-identifier-from-agentlab", 
        "date_time": "YYYY-MM-DD HH:MM:SS",
        "benchmark": "WebArena",
        "score": 0.0,
        "std_err": 0.0,
        "benchmark_specific": "Yes/No",
        "benchmark_tuned": "Yes/No",
        "followed_evaluation_protocol": "Yes/No", 
        "reproducible": "Yes/No",
        "comments": "Additional details",
        "original_or_reproduced": "Original"
    }
]

Please add all the benchmark files in separate json files named as follows:

  • webarena.json
  • workarena-l1.json
  • workarena-l2.json
  • workarena-l3.json
  • miniwob.json

Each file must contain a JSON array with a single object following the format above. The benchmark field in each file must match the benchmark name exactly ([WebArena, WorkArena-L1, WorkArena-L2, WorkArena-L3, MiniWoB]) and benchmark_lowercase.json as the filename.

4. Submit PR

  1. Open the community tab and press "New Pull Request"
  2. Give it a new title to the PR and follow the steps mentioned
  3. Publish the branch

How to Submit Reproducibility Results for Existing Agents

Open the results file for the agent and benchmark you reproduced the results for.

1. Add reproduced results

Append the following entry in the json file. Ensure you set original_or_reproduced as Reproduced.

[
    {
        "agent_name": "your-agent-name",
        "study_id": "unique-study-identifier-from-agentlab", 
        "date_time": "YYYY-MM-DD HH:MM:SS",
        "benchmark": "WebArena",
        "score": 0.0,
        "std_err": 0.0,
        "benchmark_specific": "Yes/No",
        "benchmark_tuned": "Yes/No",
        "followed_evaluation_protocol": "Yes/No", 
        "reproducible": "Yes/No",
        "comments": "Additional details",
        "original_or_reproduced": "Reproduced"
    }
]

2. Submit PR

  1. Open the community tab and press "New Pull Request"
  2. Give it a new title to the PR and follow the steps mentioned
  3. Publish the branch

License

MIT