jostyposty's picture
docs: add usage info
21c3779
metadata
library_name: stable-baselines3
tags:
  - LunarLander-v2
  - deep-reinforcement-learning
  - reinforcement-learning
  - stable-baselines3
model-index:
  - name: ppo-LunarLander-v2_010_000_000_hf_defaults
    results:
      - task:
          type: reinforcement-learning
          name: reinforcement-learning
        dataset:
          name: LunarLander-v2
          type: LunarLander-v2
        metrics:
          - type: mean_reward
            value: 311.61 +/- 6.23
            name: mean_reward
            verified: false
license: mit

PPO Agent playing LunarLander-v2

This is a trained model of a PPO agent playing LunarLander-v2 using the stable-baselines3 library.

Training

When I first started training, I experimented with different parameter values to see if I could find something that gave better results than others. I ended up just using the defaults provided by Hugging Face (HF), but the differences in results between those defaults and the defaults from Stable Baselines3 (SB3) where not that large in my findings.

Defaults name n_steps batch_size n_epochs gamma gae_lambda ent_coef
Hugging Face Defaults (hf_defaults) 1,024 64 8 0.999 0.98 0.01
SB3 Defaults (sb3_defaults) 2,048 64 10 0.99 0.95 0.0

Models

I decided to train and upload four models. I wanted to test the following. I thought 1,000,000 (1M) timesteps was insufficient and 123,456,789 (123M) timesteps was excessively time-consuming without significant improvement in results. I believed 10,000,000 (10M) timesteps would offer a reasonable balance between training duration and outcomes. I used defaults from both Hugging Face and Stable Baseline3 when training with 10M timesteps.

Number Model name timesteps Defaults
1 ppo-LunarLander-v2_001_000_000_hf_defaults 1,000,000 hf_defaults
2 ppo-LunarLander-v2_010_000_000_hf_defaults 10,000,000 hf_defaults
3 ppo-LunarLander-v2_010_000_000_sb3_defaults 10,000,000 sb3_defaults
4 ppo-LunarLander-v2_123_456_789_hf_defaults 123,456,789 hf_defaults

Evaluation

I evaluated the four models using two approaches:

  • Search: Search through a lot different random environments for a good seed
  • Average: Averaging over a lot different random environments

The code in evaluate.py shows the method of evaluating and storing the results. All the results are included in the evaluation_results.csv file. The result is mean_reward - std_reward, but I also store mean_reward, std_reward, and seed and n_envs as well.

Results

Model name Number of results Min Max Average
ppo-LunarLander-v2_001_000_000_hf_defaults 4136 144.712 269.721 240.895
ppo-LunarLander-v2_010_000_000_hf_defaults 4136 130.43 305.384 270.451
ppo-LunarLander-v2_010_000_000_sb3_defaults 4136 87.9966 298.898 269.568
ppo-LunarLander-v2_123_456_789_hf_defaults 4136 141.814 302.567 268.735

Conclusion

As suspected, the 1M model performed the worst. I really don't think there are significant differences between the two 10M and the 123M models.

Disclaimer regarding the evaluation result

I kind of don't like the randomness we get by the current method for evaluating the model. As you see, I tested with different seeds and number of parallel environments for the same model, and I got quite varying results. I have not manually updated the score to the better, neither used a lower number for n_eval_episodes. The latter would give a better result, as there would be less to average over. But, as can be seen in evaluation_results.csv, I do have "mined" for a good seed for when to share.

A better way to evaluate the models?

Perhaps we should average over more environments? Wouldn't this give a result less prone to the randomness of the environments? When averaging over the environments, we get a much more stable result. So I think this perhaps could be a better way of evaluating the results for use in a leader board. In short: n_eval_episodes=10 and average over at least 10 different random environments.

Usage (with Stable-baselines3)

import gymnasium as gym
from huggingface_sb3 import load_from_hub
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.monitor import Monitor

env_id = "LunarLander-v2"

model_fp = load_from_hub(
    "jostyposty/drl-course-unit-01-lunar-lander-v2",
    "ppo-LunarLander-v2_010_000_000_hf_defaults.zip",
)

model = PPO.load(model_fp, print_system_info=True)
eval_env = Monitor(gym.make(env_id))
mean_reward, std_reward = evaluate_policy(
    model, eval_env, n_eval_episodes=10, deterministic=True
)
print(f"results: {mean_reward - std_reward:.2f}")
print(f"mean_reward: {mean_reward:.2f} +/- {std_reward}")