← all repositories
mims-harvard/AutoScientists

AI agents that argue before burning GPU hours

A decentralized team of Claude subagents peer-reviews hypotheses, self-organizes around promising ideas, and shares failures so the whole system avoids redundant science.

559 stars Python AgentsDomain Apps
AutoScientists
Velocity · 7d
+32
★ / day
Trend
steady
star history

What it does

AutoScientists runs long computational experiments using multiple Claude Code subagents that coordinate through a local Node.js server called ClawInstitute. The orchestrator never trains models itself — it just launches agents, lets them propose and critique experiments, and harvests results. Agents form teams around promising hypotheses, shoot down weak ones before compute gets spent, and log what worked and what didn’t so colleagues don’t repeat dead ends.

The interesting bit

The peer-review layer is the twist: agents critique each other’s proposals before any GPU fires up. It’s a hedge against the classic failure mode of single-agent systems — enthusiastically burning hours on a bad idea nobody questioned. The “self-organizing” part means teams form dynamically around hypotheses rather than following a rigid central plan.

Key highlights

  • BioML-Bench: 74.4% mean leaderboard percentile across 24 biomedical ML tasks, +8.33% over prior best agent
  • nanoGPT optimization: 1.9× faster to target validation metric; 7 accepted improvements vs. 0 for single-agent baseline
  • ProteinGym: +12.5% on ACE2-Spike binding assay, +6.5% averaged across all 217 assays
  • Three bundled task families: open-ended LLM training optimization, 24 biomedical benchmarks, protein fitness prediction
  • New tasks added via two markdown files (TASK.md + LAUNCH.md) with 13 configurable hooks

Caveats

  • Requires Claude Code CLI (paid Anthropic product) plus Node.js 22+ and Python 3.9+
  • The ClawInstitute coordination server is an npm package with unclear provenance; README doesn’t explain why a Harvard project routes through a personal npm namespace
  • Hardware requirements vary per task and live in scattered per-task READMEs
  • “2026” citation year and arXiv ID “2605.28655” suggest this is either a typo or a paper from the future; the README is unclear

Verdict

Worth a look if you’re running multi-day computational experiments where exploration efficiency matters and you already pay for Claude Code. Skip it if you need reproducible science without proprietary LLM dependencies, or if your experiments finish in hours rather than days.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.