library_name: stable-baselines3
tags:
- LunarLander-v2
- deep-reinforcement-learning
- reinforcement-learning
- stable-baselines3
model-index:
- name: ppo-LunarLander-v2_010_000_000_hf_defaults
results:
- task:
type: reinforcement-learning
name: reinforcement-learning
dataset:
name: LunarLander-v2
type: LunarLander-v2
metrics:
- type: mean_reward
value: 311.61 +/- 6.23
name: mean_reward
verified: false
license: mit
PPO Agent playing LunarLander-v2
This is a trained model of a PPO agent playing LunarLander-v2 using the stable-baselines3 library.
Training
When I first started training, I experimented with different parameter values to see if I could find something that gave better results than others. I ended up just using the defaults provided by Hugging Face (HF), but the differences in results between those defaults and the defaults from Stable Baselines3 (SB3) where not that large in my findings.
Defaults name | n_steps | batch_size | n_epochs | gamma | gae_lambda | ent_coef |
---|---|---|---|---|---|---|
Hugging Face Defaults (hf_defaults) | 1,024 | 64 | 8 | 0.999 | 0.98 | 0.01 |
SB3 Defaults (sb3_defaults) | 2,048 | 64 | 10 | 0.99 | 0.95 | 0.0 |
Models
I decided to train and upload four models. I wanted to test the following. I thought 1,000,000 (1M) timesteps was insufficient and 123,456,789 (123M) timesteps was excessively time-consuming without significant improvement in results. I believed 10,000,000 (10M) timesteps would offer a reasonable balance between training duration and outcomes. I used defaults from both Hugging Face and Stable Baseline3 when training with 10M timesteps.
Number | Model name | timesteps | Defaults |
---|---|---|---|
1 | ppo-LunarLander-v2_001_000_000_hf_defaults | 1,000,000 | hf_defaults |
2 | ppo-LunarLander-v2_010_000_000_hf_defaults | 10,000,000 | hf_defaults |
3 | ppo-LunarLander-v2_010_000_000_sb3_defaults | 10,000,000 | sb3_defaults |
4 | ppo-LunarLander-v2_123_456_789_hf_defaults | 123,456,789 | hf_defaults |
Evaluation
I evaluated the four models using two approaches:
- Search: Search through a lot different random environments for a good seed
- Average: Averaging over a lot different random environments
The code in evaluate.py shows the method of evaluating and storing the results. All the results are included in the evaluation_results.csv file. The result is mean_reward - std_reward, but I also store mean_reward, std_reward, and seed and n_envs as well.
Results
Model name | Number of results | Min | Max | Average |
---|---|---|---|---|
ppo-LunarLander-v2_001_000_000_hf_defaults | 4136 | 144.712 | 269.721 | 240.895 |
ppo-LunarLander-v2_010_000_000_hf_defaults | 4136 | 130.43 | 305.384 | 270.451 |
ppo-LunarLander-v2_010_000_000_sb3_defaults | 4136 | 87.9966 | 298.898 | 269.568 |
ppo-LunarLander-v2_123_456_789_hf_defaults | 4136 | 141.814 | 302.567 | 268.735 |
Conclusion
As suspected, the 1M model performed the worst. I really don't think there are significant differences between the two 10M and the 123M models.
Disclaimer regarding the evaluation result
I kind of don't like the randomness we get by the current method for evaluating the model. As you see, I tested with different seeds and number of parallel environments for the same model, and I got quite varying results. I have not manually updated the score to the better, neither used a lower number for n_eval_episodes. The latter would give a better result, as there would be less to average over. But, as can be seen in evaluation_results.csv, I do have "mined" for a good seed for when to share.
A better way to evaluate the models?
Perhaps we should average over more environments? Wouldn't this give a result less prone to the randomness of the environments? When averaging over the environments, we get a much more stable result. So I think this perhaps could be a better way of evaluating the results for use in a leader board. In short: n_eval_episodes=10 and average over at least 10 different random environments.
Usage (with Stable-baselines3)
import gymnasium as gym
from huggingface_sb3 import load_from_hub
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.monitor import Monitor
env_id = "LunarLander-v2"
model_fp = load_from_hub(
"jostyposty/drl-course-unit-01-lunar-lander-v2",
"ppo-LunarLander-v2_010_000_000_hf_defaults.zip",
)
model = PPO.load(model_fp, print_system_info=True)
eval_env = Monitor(gym.make(env_id))
mean_reward, std_reward = evaluate_policy(
model, eval_env, n_eval_episodes=10, deterministic=True
)
print(f"results: {mean_reward - std_reward:.2f}")
print(f"mean_reward: {mean_reward:.2f} +/- {std_reward}")