RUC-NLPIR/Arbor

An agent that runs the experiment instead of hallucinating the result

Because one-shot agents forget every failure, Arbor grows a living hypothesis tree that runs real experiments and only keeps what survives validation.

★628 stars Python Agents LLMOps · Eval

View on GitHub ↗ Homepage ↗

Collecting fresh signals — velocity needs a few days of history.

collecting data…

star history

What it does

Arbor is a Python-based autonomous research agent that takes a benchmark directory and a goal, then iteratively tries to improve it. A Coordinator agent maintains an Idea Tree of hypotheses, while an Executor agent implements each one in an isolated git worktree, tests against a dev split, and validates on held-out data. The system only merges gains that clear a configurable margin, leaving main untouched until you approve.

The interesting bit

The framework treats research as tree search rather than prompt engineering: failed branches are pruned, successful ones are harvested, and insights propagate upward so the Coordinator’s next ideas inherit the context of everything that came before. It also checks idea novelty against alphaXiv prior art before burning compute, and can run fully inside Claude Code or Codex without requiring its own API key by acting as a deterministic tool suite rather than an LLM client.

Key highlights

Hypothesis-tree refinement: maintains cumulative state across long-horizon tasks instead of one-shot attempts
Real experiment discipline: isolated git worktrees, dev/test splits, and guarded merges
Literature-grounded ideation: keyless alphaXiv search vets novelty before execution
Flexible backends: works with Anthropic, OpenAI, LiteLLM-compatible models, or as a keyless skill inside existing coding agents
Live dashboard and optional human-in-the-loop review for steerable autonomy

Caveats

The README promises “general-purpose optimization” but never clarifies hard limits or failure modes, so what it cannot do remains vague
Full tree/eval/merge discipline and checkpointing require the native CLI runtime; the keyless harness integration offers a narrower tool suite

Verdict

Try it if you have a long-horizon optimization problem with a clear metric and want an agent that remembers what failed. Look elsewhere if you need a solution that requires no API access or host agent at all, since even the keyless mode leans on Claude Code or Codex.