← all repositories
karpathy/autoresearch

Sleep while your GPU does the research

An AI agent that edits, trains, and evaluates LLM code overnight so you don't have to.

85.5k stars Python AgentsLLMOps · Eval
autoresearch
Velocity · 7d
+917
★ / day
Trend
steady
star history

What it does

This repo sets up a single-GPU LLM training loop and hands the keyboard to an AI agent. You write instructions in program.md; the agent edits train.py, runs a 5-minute experiment, checks if validation loss improved, and repeats. The goal is waking up to a log of ~100 overnight experiments and hopefully a better model.

The interesting bit

The human doesn’t touch Python. You program the organization — the program.md “skill” that tells the agent how to experiment — while the agent programs the model. It’s a deliberate inversion: the researcher becomes a meta-researcher, tuning the research process rather than the hyperparameters.

Key highlights

  • Three files, total: immutable prepare.py, agent-editable train.py, human-editable program.md
  • Fixed 5-minute wall-clock runs make experiments comparable regardless of what the agent changes (architecture, batch size, model depth)
  • Metric is val_bpb (validation bits per byte), so vocabulary changes don’t skew comparisons
  • Built on a simplified single-GPU nanochat stack; no distributed training, no config sprawl
  • Community forks already exist for MacOS, Windows, AMD, and MLX

Caveats

  • Requires an NVIDIA GPU; Karpathy is explicitly unsure about taking on CPU/MPS support himself
  • Results aren’t comparable across different hardware platforms due to the fixed-time design
  • The default program.md is intentionally bare-bones — you’ll need to iterate on it yourself

Verdict

Worth a look if you’re curious about automated experimentation and have a GPU to burn overnight. Skip it if you want production training infrastructure or need to run on non-NVIDIA hardware without community forks.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.