An AI that peer-reviewed its own workshop paper — sort of
Sakana's second-generation system automates hypothesis generation, experimentation, and manuscript writing through tree search, no human templates required.

What it does The AI Scientist-v2 is an end-to-end autonomous research agent. You feed it a topic description; it brainstorms ideas via LLM, checks novelty against Semantic Scholar, then runs experiments through a best-first tree search. If a branch fails, it debugs or abandons it. Surviving branches get analyzed, plotted, and written up into a full PDF manuscript with citations.
The interesting bit Version 2 deliberately shed the human-authored templates that made v1 reliable. The trade-off: lower success rate, but broader exploration across ML domains. The system also produced what its creators claim is the first workshop paper written entirely by AI and accepted through peer review — though the README is honest that v2 “doesn’t necessarily produce better papers than v1” when you already have a strong template.
Key highlights
- Template-free ideation:
perform_ideation_temp_free.pygenerates structured research ideas from a Markdown topic file, with novelty checking via Semantic Scholar. - Agentic tree search: Configurable parallel workers and debug retries in
bfts_config.yaml; defaults to Claude 3.5 Sonnet for experimentation. - Multi-model pipeline: Different LLMs handle experiments, write-up, citation, review, and plot aggregation — mix OpenAI, Gemini, or AWS Bedrock models.
- Tangible output: Produces timestamped experiment logs with an interactive tree visualization (
unified_tree_viz.html) and a final PDF. - Documented costs: Roughly $15–$20 per experiment run plus ~$5 for writing, with ideation at “a few dollars.”
Caveats
- Safety warning is not boilerplate: The README explicitly warns that the system executes LLM-written code with “uncontrolled web access” and potential for “dangerous packages” — Docker sandboxing is strongly advised.
- Linux + NVIDIA only: Requires CUDA, PyTorch, and a conda stack; installation is estimated at up to an hour.
- Success is model-dependent: The FAQ notes failed runs are common, especially with weaker foundation models, and some parameters in
bfts_config.yaml(likek_fold_validation) are currently unused.
Verdict Worth a look if you’re researching autonomous agents or want to stress-test LLM-driven experimental design. Skip it if you need reproducible, high-yield results on a known problem — the authors themselves recommend v1 for that.