← all repositories
danijar/crafter

A Minecraft-like benchmark where RL agents still score under 20%

A single 2D survival environment designed to test the full spectrum of agent capabilities—exploration, reasoning, and long-term credit assignment—without training dozens of separate models.

553 stars Python AgentsDomain Apps
crafter
Velocity · 7d
+0.3
★ / day
Trend
steady
star history

What it does

Crafter is a lightweight, 2D open-world survival game built for reinforcement learning research. Agents forage, craft tools, fight monsters, and unlock 22 semantically meaningful achievements, all within a standard Gym interface and 64×64 pixel observations. The twist: it evaluates many skills in one environment, so you don’t need a server farm full of separate Atari runs to benchmark generalization.

The interesting bit

The scoring is deliberately punishing. Success rates across 22 achievements are combined into a geometric mean, so agents can’t game the leaderboard by mastering only the easy stuff. Even the best published RL algorithms score under 20%; humans hit 50.5%. It’s a rare benchmark where the gap between machine and human performance is both visible and measured in detail.

Key highlights

  • Single environment tests exploration, representation learning, and long-term reasoning simultaneously
  • 1M step evaluation budget; observations are just 64×64 RGB images
  • 17 discrete actions; reward is sparse (+1 per achievement, ±0.1 for health changes)
  • Human-playable via PyGUI with full keyboard controls
  • Baselines and pre-computed scores available in JSON; separate repo for baseline implementations

Caveats

  • The “External Knowledge” scoreboard includes LLM-assisted agents with unclear comparability (some use zero environment steps, others 5M)
  • Several top entries are marked closed-source, limiting reproducibility

Verdict

Worth a look if you’re building agents that need to do more than master a single task. Skip it if you want established, solved environments—Crafter is explicitly designed to be hard.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.