← all repositories
MineDojo/NitroGen

A gaming bot that learned to mash buttons by watching YouTube

NitroGen is a 500M-parameter model trained on internet gameplay videos to predict gamepad actions from raw pixels across multiple games.

2k stars Python AgentsML FrameworksLearning
NitroGen
Velocity · 7d
+12
★ / day
Trend
steady
star history

What it does

NitroGen watches your game screen and presses buttons. It’s a single model trained via behavior cloning on what the authors call the largest video-action gameplay dataset scraped from internet videos. You point it at a Windows executable, it hooks into the process, and it plays. The inference server can run on Linux, but the game itself must live on Windows 11 with Python ≥ 3.12.

The interesting bit

The honesty is refreshing: the authors explicitly call this a “fast-reacting system-1 sensory model” with no memory, no planning, and no ability to improve itself. It’s essentially a very expensive reflex arc — a 500M-parameter DiT that sees only the last frame and spits out a gamepad action. The ambition is clearly “foundation model for generalist agents,” but the current reality is more “sophisticated twitch response.”

Key highlights

  • Single model architecture (DiT) targeting multiple games from pixel input alone
  • Trained exclusively on internet gameplay videos, no synthetic rollouts or RL
  • Supports post-training adaptation to unseen games (though not zero-shot play)
  • Open weights and dataset available on HuggingFace
  • Simple two-script workflow: serve.py for inference, play.py for execution

Caveats

  • Windows-only for the game runtime; Linux users are second-class citizens here
  • No game environments included — bring your own licensed copies
  • The model cannot plan, remember, or finish games end-to-end; it’s frame-to-frame reactive only
  • “Not an official NVIDIA product” despite the HuggingFace namespace

Verdict

Worth a look if you’re researching generalist agent architectures or behavior cloning at scale. Skip it if you want an agent that actually finishes a level — this one is still learning to walk before it runs, quite literally one frame at a time.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.