CoreyMorris's picture
SB3 PPO. Vectorized 16 env. ~ 9_000_000 timesteps of training. mean_reward=163 +/- 103 . Training for an additional 50_000_000 timesteps resulted in a worse reward when evaluating
28a0b97