An agent that runs the experiment instead of hallucinating the result
Because one-shot agents forget every failure, Arbor grows a living hypothesis tree that runs real experiments and only keeps what survives validation.

What it does
Arbor is a Python-based autonomous research agent that takes a benchmark directory and a goal, then iteratively tries to improve it. A Coordinator agent maintains an Idea Tree of hypotheses, while an Executor agent implements each one in an isolated git worktree, tests against a dev split, and validates on held-out data. The system only merges gains that clear a configurable margin, leaving main untouched until you approve.
The interesting bit
The framework treats research as tree search rather than prompt engineering: failed branches are pruned, successful ones are harvested, and insights propagate upward so the Coordinator’s next ideas inherit the context of everything that came before. It also checks idea novelty against alphaXiv prior art before burning compute, and can run fully inside Claude Code or Codex without requiring its own API key by acting as a deterministic tool suite rather than an LLM client.
Key highlights
- Hypothesis-tree refinement: maintains cumulative state across long-horizon tasks instead of one-shot attempts
- Real experiment discipline: isolated git worktrees, dev/test splits, and guarded merges
- Literature-grounded ideation: keyless alphaXiv search vets novelty before execution
- Flexible backends: works with Anthropic, OpenAI, LiteLLM-compatible models, or as a keyless skill inside existing coding agents
- Live dashboard and optional human-in-the-loop review for steerable autonomy
Caveats
- The README promises “general-purpose optimization” but never clarifies hard limits or failure modes, so what it cannot do remains vague
- Full tree/eval/merge discipline and checkpointing require the native CLI runtime; the keyless harness integration offers a narrower tool suite
Verdict
Try it if you have a long-horizon optimization problem with a clear metric and want an agent that remembers what failed. Look elsewhere if you need a solution that requires no API access or host agent at all, since even the keyless mode leans on Claude Code or Codex.