A Minecraft-like benchmark where RL agents still score under 20%
A single 2D survival environment designed to test the full spectrum of agent capabilities—exploration, reasoning, and long-term credit assignment—without training dozens of separate models.

What it does
Crafter is a lightweight, 2D open-world survival game built for reinforcement learning research. Agents forage, craft tools, fight monsters, and unlock 22 semantically meaningful achievements, all within a standard Gym interface and 64×64 pixel observations. The twist: it evaluates many skills in one environment, so you don’t need a server farm full of separate Atari runs to benchmark generalization.
The interesting bit
The scoring is deliberately punishing. Success rates across 22 achievements are combined into a geometric mean, so agents can’t game the leaderboard by mastering only the easy stuff. Even the best published RL algorithms score under 20%; humans hit 50.5%. It’s a rare benchmark where the gap between machine and human performance is both visible and measured in detail.
Key highlights
- Single environment tests exploration, representation learning, and long-term reasoning simultaneously
- 1M step evaluation budget; observations are just 64×64 RGB images
- 17 discrete actions; reward is sparse (+1 per achievement, ±0.1 for health changes)
- Human-playable via PyGUI with full keyboard controls
- Baselines and pre-computed scores available in JSON; separate repo for baseline implementations
Caveats
- The “External Knowledge” scoreboard includes LLM-assisted agents with unclear comparability (some use zero environment steps, others 5M)
- Several top entries are marked closed-source, limiting reproducibility
Verdict
Worth a look if you’re building agents that need to do more than master a single task. Skip it if you want established, solved environments—Crafter is explicitly designed to be hard.