A3C goes brrr: GPU-hybrid training that cut Atari days to 10 minutes
A PyTorch A3C implementation that keeps per-agent networks on GPU but shoves updates to a CPU-shared model via Hogwild, trading elegance for raw speed.

What it does
Implements Asynchronous Advantage Actor-Critic (A3C) with LSTM for Atari 2600 games in OpenAI Gym. The repo includes trained models and evaluation scripts for classics like Breakout, Pong, and Space Invaders, plus TensorBoard logging and a distributed step-size training mode that lets different workers use different rollout lengths simultaneously.
The interesting bit
The author coins “A3G” — a deliberately lopsided architecture where each worker’s network lives on GPU, but the shared model stays on CPU. Workers convert to CPU, update asynchronously without locks (Hogwild-style), then scurry back to GPU. It’s not pretty, but the README claims Pong converges in 10 minutes on 4× V100s and a 20-core CPU, down from “days.”
Key highlights
- Holds (or held) top OpenAI Gym leaderboard scores for several Atari v0 environments, including a claimed world-record 167,330 on SpaceInvadersDeterministic-v3
- Supports both RMSProp and Adam with shared statistics, plus optional non-shared optimizers
- Distributed step-size training lets you pass a list like
--distributed-step-size 16 32 64to vary rollout lengths across workers - Includes a separate continuous-action variant (linked as
a3c_continuous) that reportedly solved BipedWalkerHardcore-v3 - README warns PyTorch 2.0 has a GPU memory bug during
backward()on training processes — downgrade if needed
Caveats
- Trained models were removed from the repo to save space, so you’ll need to train your own or hunt through git history
- README is sparse on reproducibility details: exact PyTorch version, CUDA version, and whether those 10-minute claims hold on modern hardware are unclear
- The “world record” scores and leaderboard links point to the now-defunct OpenAI Gym evaluations site, so they’re unverifiable
Verdict
Worth a look if you’re researching A3C variants or need a fast, hackable Atari baseline in PyTorch. Skip if you want clean, production-ready distributed RL — this is research code with battle scars.