PPO learns Mario, almost beats Bowser
A clean PyTorch implementation of PPO that clears 31 of 32 Super Mario Bros levels, with the author admitting level 8-4 still wins.

What it does
Trains a reinforcement-learning agent to play Super Mario Bros using Proximal Policy Optimization (PPO) in PyTorch. You pick a world and stage, set a learning rate, and run train.py. The author provides a Dockerfile for GPU training, plus test.py to render the results to an MP4.
The interesting bit
The author previously got only 19/32 levels with A3C. Switching to PPO jumped that to 31/32 — and the README candidly admits the final missing level (8-4) is a maze puzzle the agent still can’t solve. The fix for stuck levels? Brute-force learning-rate search, including one success at 7e-5 after 70 failures.
Key highlights
- 31 of 32 levels cleared; only the maze level 8-4 remains undefeated
- Direct comparison to the author’s earlier A3C implementation (19/32 levels)
- Docker support with documented rendering bug and workaround
- Simple CLI:
python train.py --world 5 --stage 2 --lr 1e-4 - Test mode outputs MP4 videos for review
Caveats
- Docker training requires manually commenting out
env.render()to avoid a rendering bug - The author notes some levels need extensive learning-rate tuning (70 attempts for 1-3)
- No code details on network architecture or reward shaping in the README
Verdict
Worth a look if you want a working, reproducible PPO baseline for NES emulation. Skip it if you need a fully general RL framework — this is tightly coupled to Mario and the gym-super-mario-bros environment.