AI Scientist-v2: The Robot That Writes Its Own Papers

5 min read Tiếng Việt
Featured image for SakanaAI/AI-Scientist-v2 — AI Scientist-v2: The Robot That Writes Its Own Papers

⚡ TLDR

  • Writing a research paper takes months. This system does it overnight, autonomously.
  • Without it, you need a grad student, a GPU cluster, a Semantic Scholar account, and three rounds of LaTeX edits at 2am.
  • It’s best for ML researchers who want to explore open-ended ideas without manual experiment iteration.
  • Unlike v1, it drops human-authored templates entirely and uses best-first tree search to branch through experimental space.
  • You describe a topic in a Markdown file. It generates hypotheses, runs GPU experiments, writes the paper, generates citations. You get a PDF.

Part 1: The Lab on a Cron Job

My advisor told me once that the real bottleneck in research isn’t analysis. It’s iteration. You write a hypothesis. You run it. The results are ambiguous. You tweak. You run again. Three weeks later you’re not sure if you’re testing the hypothesis or the learning rate.

Most of that iteration is mechanical. Any sufficiently motivated process could do it.

AI Scientist-v2 is SakanaAI’s answer to what happens when you make that process autonomous.

It is physically a Python codebase. You give it a Markdown file describing a research topic. It runs two stages: an ideation script that generates and scores research hypotheses against the Semantic Scholar literature, and a main pipeline that drives a best-first tree search (BFTS) across experimental branches. At the end, it produces a LaTeX-formatted PDF.

Not a summary. Not a literature review. A full paper, with a methods section, results, citations, and plots.

A PhD student who only sleeps when the GPU runs out of memory.

Part 2: How the Tree Grows

The core mechanism isn’t LLM prompting. It’s tree search.

When the pipeline starts, it spawns a set of “root” nodes, each representing an initial experimental approach. From each node, it expands outward: the agent modifies the code, runs it, evaluates the results, and decides which branch is worth exploring further. It uses Claude 3.5 Sonnet by default to drive this experimentation.

# Describe your topic
python ai_scientist/perform_ideation_temp_free.py \
  --workshop-file "ai_scientist/ideas/my_topic.md" \
  --model gpt-4o-2024-05-13 \
  --max-num-generations 20 \
  --num-reflections 5

That generates a JSON file of ranked research ideas. Then:

# Run the full pipeline
python launch_scientist_bfts.py \
  --load_ideas "ai_scientist/ideas/my_topic.json" \
  --load_code \
  --model_writeup o1-preview-2024-09-12 \
  --model_review gpt-4o-2024-11-20 \
  --num_cite_rounds 20

The tree search parameters live in bfts_config.yaml. You set num_workers (how many branches to explore in parallel) and steps (how many nodes total). Three workers, twenty-one steps. It runs for several hours. At the end, you get a timestamped PDF.

The structure it searches:

Research Idea
├── Approach A (root node)
│   ├── Variant A1 → score 0.72
│   ├── Variant A2 → score 0.68 (pruned)
│   └── Variant A3 → score 0.81 (best)
├── Approach B (root node)
│   └── Variant B1 → score 0.59 (pruned)
└── Approach C (root node)
    └── Variant C1 → score 0.77

The experiment manager agent decides what to expand next. It isn’t random. It’s a greedy selection on estimated value.

Part 3: Comparing the Lab Assistants

AI Scientist-v2 is the second version. The first used fixed, human-authored templates per domain. That made v1 reliable but narrow. You could only run it in domains that had templates.

v2 removes the templates. It works across arbitrary ML domains. The tradeoff is explicit: lower success rate, higher flexibility.

v1 (AI Scientist)v2 (AI Scientist-v2)
TemplatesRequired per domainNone
FlexibilityNarrow, task-specificBroad, open-ended
Success rateHigherLower
Use caseClear objective, defined scopeExploratory discovery
Tree searchNoBest-first (BFTS)

The paper it produced in its first public run was accepted at an ICLR 2025 workshop. Not as a curiosity. Through actual peer review.

Part 4: What It Costs and Where It Breaks

The setup requires Linux with NVIDIA GPUs, CUDA, and PyTorch. There is no CPU-only path. The ideation step costs a few dollars in API calls. The main pipeline, using Claude 3.5 Sonnet for experiments, runs $15-20 per run. Add ~$5 for the writeup phase with o1.

If the topic is too vague, the experiments meander. If the model can’t debug its own failing code, the branch gets abandoned (controlled by max_debug_depth and debug_prob in the config). Success is not guaranteed. The FAQ says so plainly.

It also executes LLM-written code on your machine. The README warns you to run this in a Docker container. That’s not boilerplate caution. The system can spawn processes, access the web, and install packages as part of its experimental loop. Run it in a sandbox.

The license is a custom “AI Scientist Source Code License.” If you publish anything generated by this system, you are legally required to disclose AI authorship. Not optional.

The real limitation isn’t technical. It’s epistemic. The system can iterate on experiments, but it cannot tell you if the research question was worth asking in the first place. That part is still yours.

My advisor was right about the bottleneck being iteration. He just never anticipated that we’d automate the iteration and find ourselves back at the same problem one layer up: figuring out what to iterate on.

Hoang Yell

Hoang Yell

A software developer and technical storyteller. I spend my time exploring the most interesting open-source repositories on GitHub and presenting them as accessible stories for everyone.