Skip to content

Projects catalog

This catalog evaluates agentic-research systems against the standard rubric. Vocabularies for stages, architectural features, focus, and disciplinary scope are defined in projects/VOCABULARY.md.

The matrix and per-project pages below are auto-generated from projects/*.yml and projects/landscape/*.yml by scripts/build_indexes.py. Do not edit by hand — edit the YAML sources.

Comparison matrix

Project Type Focus LC AUT ARC IN OUT EVAL OPEN MAT XF RUN PORT Discipline
E2ER — End-to-End Research owned end-to-end 3 2 2 3 2 1 2 1 0 2 1 economics
Academic Research Skills (ARS) external end-to-end 2 1 3 3 3 3 2 3 1 3 3 general
Agent Laboratory external end-to-end 3 2 3 2 2 2 3 3 0 1 1 computer-science
AlphaEvolve (Google DeepMind) external end-to-end 1 3 1 1 2 3 0 2 0 3 0 mathematics
Project APE external end-to-end 3 3 3 2 3 3 3 1 1 2 1 economics
ARIS (Auto-Research-In-Sleep) external end-to-end 3 3 3 3 2 3 3 3 2 3 3 computer-science
AstaBench (AI2) external end-to-end 0 2 3 3 3 2 3 2 1 1 2 general
AutoResearchClaw external end-to-end 3 2 2 3 2 2 3 3 1 3 3 general
AutoSurvey external literature 1 3 2 1 2 3 1 1 0 1 1 general
Aviary (FutureHouse) external end-to-end 0 2 3 2 3 2 3 2 1 1 2 general
Clo-Author external end-to-end 3 2 3 2 2 2 1 1 0 2 1 economics
Coarse (coarse.ink) external review 0 2 2 1 1 1 3 1 1 1 2 general
CORAL external end-to-end 2 3 3 2 2 2 3 2 1 2 2 general
data-to-paper external end-to-end 2 3 3 2 3 3 3 2 0 2 1 general
DeepResearcher (GAIR-NLP) external literature 1 3 3 2 2 3 3 2 0 2 1 general
EvoScientist external end-to-end 3 3 3 2 2 3 3 3 1 2 3 general
GPT Researcher external literature 1 3 3 2 2 1 3 3 0 1 2 general
Kosmos (jimmc414 implementation) external end-to-end 2 3 3 2 2 2 1 2 1 2 1 general
MARG (Multi-Agent Review Generation) external review 0 2 3 1 3 2 3 1 0 1 0 general
MLGym (Meta) external end-to-end 0 2 3 2 3 2 2 2 0 1 1 computer-science
Open CoScientist Agents external ideation 1 3 3 2 1 1 3 1 3 2 1 general
OpenScholar (AI2) external literature 0 2 3 2 2 3 3 2 1 1 1 general
PaperQA2 (FutureHouse) external literature 0 2 3 2 2 3 3 3 1 3 3 general
PaperCoder (Paper2Code) external replication 1 3 3 2 3 3 3 3 0 2 2 computer-science
RECAST (Replication and Extension with Causal AI Statistical Toolkit) external replication 2 2 3 2 2 2 3 1 0 3 1 econometrics
Refine (refine.ink) external review 0 2 1 1 1 1 0 2 0 1 0 general
ResearchTown external ideation 2 3 3 2 2 2 3 2 0 1 1 general
ResearchAgent (NAACL 2025) external ideation 1 2 3 2 2 2 1 1 0 2 0 general
Reviewer (Ingar30) external review 0 2 3 1 2 1 3 1 0 2 0 economics
Robin (FutureHouse) external end-to-end 2 2 3 2 1 2 2 2 1 2 2 biomedical
Sakana AI Scientist v2 external end-to-end 2 3 3 1 2 2 3 2 0 1 0 computer-science
Sakana AI Scientist (v1) external end-to-end 2 3 3 1 2 2 2 3 0 1 0 computer-science
Social Science Replicability Infrastructure external replication 1 2 2 2 2 1 3 1 0 2 1 social-sciences
STORM / Co-STORM external literature 1 2 3 2 2 2 3 3 0 1 2 general
SurveyX external literature 1 3 2 1 1 2 1 2 0 1 1 general
Tongyi DeepResearch external literature 1 3 3 2 2 3 3 3 0 1 2 general
ToolUniverse external end-to-end 0 2 3 3 2 2 3 2 1 2 2 biomedical
zeropaper (Auto AI Research Template) external end-to-end 3 3 2 2 2 2 1 1 1 3 1 finance
Zochi (Intology) external end-to-end 3 3 2 2 2 3 2 2 0 2 1 computer-science

Score columns: LC = lifecycle coverage, AUT = autonomy, ARC = architectural transparency, IN = inputs supported, OUT = outputs/reproducibility, EVAL = internal evaluation, OPEN = openness, MAT = maturity/traction, XF = cross-family policy, RUN = runtime assurance, PORT = cross-platform portability. Scale 0–3. See the evaluation rubric.

One-line summaries

  • E2ER — End-to-End Research — E2ER is a strategist-driven agentic research pipeline that takes a research idea (human- or agent-supplied) and carries it through literature synthesis, identification, data acquisition, analysis, and paper drafting.
  • Academic Research Skills (ARS) — A comprehensive Claude Code plugin suite (v3.9.0 at scoring date) for the academic research pipeline: literature → write → review → revise → finalize.
  • Agent Laboratory — An end-to-end autonomous research workflow (arXiv:2501.04227) that guides a research idea through three phases — literature review, experimentation, and report writing — with specialized LLM-driven agents and external tools (arXiv, Hugging Face, Python, LaTeX).
  • AlphaEvolve (Google DeepMind) — A Gemini-powered evolutionary coding agent that combines LLM generative capabilities with automated evaluators in an iterative propose-test-refine loop.
  • Project APE — An autonomous system that generates empirical economic policy research papers end-to-end from publicly available data, then scores them via a TrueSkill tournament in which AI-generated papers compete head-to-head against peer-reviewed human benchmarks from AER and AEJ:Policy (judged by Gemini 3.1 Flash Lite).
  • ARIS (Auto-Research-In-Sleep) — An open-source research harness for autonomous ML research (arXiv:2605.03042) built around cross-model adversarial collaboration: an executor model drives forward progress while a reviewer from a different model family critiques intermediate artifacts and requests revisions.
  • AstaBench (AI2) — An evaluation framework from AI2 for measuring scientific-research abilities of AI agents.
  • AutoResearchClaw — An autonomous research pipeline taking a chat-level idea to a full paper via ACP-compatible agent back-ends (Claude Code, Codex CLI, Copilot CLI, Gemini CLI, Kimi CLI).
  • AutoSurvey — A NeurIPS 2024 framework (arXiv:2406.10252) for automatically generating comprehensive literature surveys from a topic and a paper database.
  • Aviary (FutureHouse) — A gymnasium for defining custom language-agent environments (arXiv:2412.21154), with pre-built environments for math, general knowledge, biological sequences, scientific literature search, and protein stability.
  • Clo-Author — A Claude Code scaffold for empirical economics research, spanning literature review through journal submission.
  • Coarse (coarse.ink) — A web-based AI peer-review service: users upload academic papers (up to 50 MB) and receive AI-generated referee reports with 20+ detailed comments.
  • CORAL — Infrastructure (arXiv:2604.01658) for multi-agent autonomous self-evolution — organizations of AI agents that run experiments, share knowledge through persistent stores, and continuously improve solutions against a user-supplied grading script.
  • data-to-paper — An end-to-end framework that takes annotated data and produces backward-traceable scientific manuscripts: every numeric value in the output can be click-traced to the specific code line that generated it.
  • DeepResearcher (GAIR-NLP) — An end-to-end RL-trained deep-research agent (arXiv:2504.03160) that learns to plan, retrieve, cross-validate, and self-reflect via reinforcement learning in real-world web environments rather than in simulated retrieval.
  • EvoScientist — A self-evolving AI scientist system (arXiv:2603.08127) built on the DeepAgents framework.
  • GPT Researcher — An autonomous "deep research" agent that produces long-form, cited reports on any topic from web and local sources.
  • Kosmos (jimmc414 implementation) — An open-source implementation of the Kosmos AI scientist architecture (Lu et al., arXiv:2511.02824), adapted to run via Claude Code or the Anthropic / OpenAI APIs.
  • MARG (Multi-Agent Review Generation) — A research artifact (arXiv:2401.04259) and reusable demo for generating peer reviews of scientific papers using multiple specialized agents.
  • MLGym (Meta) — A gym-style framework and benchmark (MLGym-Bench, arXiv:2502.14499) for advancing AI research agents on 13 diverse ML research tasks (CV, NLP, RL, game theory).
  • Open CoScientist Agents — An open-source implementation of Google DeepMind's AI co-scientist (arXiv:2502.18864), built on LangGraph and GPT Researcher.
  • OpenScholar (AI2) — A retrieval-augmented LM designed to answer scientific queries by searching the literature and generating responses grounded in sources.
  • PaperQA2 (FutureHouse) — A high-accuracy retrieval-augmented generation package focused on scientific PDFs (and Office docs, source code).
  • PaperCoder (Paper2Code) — An ICLR 2026 multi-agent system (arXiv:2504.17192) that transforms a machine-learning paper into a working code repository via a three-stage pipeline (planning, analysis, code generation) with specialized agents per stage.
  • RECAST (Replication and Extension with Causal AI Statistical Toolkit) — An end-to-end autonomous pipeline for the replication + extension + peer-review arc of the RISE concept diagram.
  • Refine (refine.ink) — A commercial AI peer-review service that produces reviewer-grade feedback on academic papers within ~20–40 minutes by running multi-hour parallel compute jobs (~2+ hours per review).
  • ResearchTown — An ICML 2025 multi-agent platform for community-level automatic research simulation.
  • ResearchAgent (NAACL 2025) — The NAACL 2025 reference implementation (arXiv:2404.07738) of iterative research idea generation over scientific literature.
  • Reviewer (Ingar30) — A reproducible multi-agent reviewer for academic economics papers.
  • Robin (FutureHouse) — A multi-agent system for automating scientific discovery (arXiv:2505.13400), with explicit support for hypothesis generation, experiment design, and data analysis.
  • Sakana AI Scientist v2 — An autonomous "AI scientist" pipeline that ideates, runs experiments (primarily ML), drafts a paper, and self-reviews.
  • Sakana AI Scientist (v1) — The original AI Scientist release (arXiv:2408.06292): an end-to-end agentic pipeline that ideates, runs experiments, and writes a paper with self-review on a fixed set of CS templates (NanoGPT, 2D Diffusion, Grokking).
  • Social Science Replicability Infrastructure — Infrastructure aimed at the replication stage of the RISE pipeline: given a published paper, attempt to reproduce its empirical results in an automated or semi-automated fashion.
  • STORM / Co-STORM — An LLM-powered knowledge-curation system that writes Wikipedia-style long-form articles from web search.
  • SurveyX — An academic survey-automation system (arXiv:2502.14776) that generates domain-specific surveys from a paper title plus retrieval keywords.
  • Tongyi DeepResearch — An agentic large language model purpose-built for long-horizon deep-information-seeking tasks (arXiv:2510.24701), shipped both as open weights (30.5B total / 3.3B active) and as inference code with ReAct and 'Heavy' (IterResearch) modes.
  • ToolUniverse — A curated tool registry and MCP server (arXiv:2509.23426) that packages biomedical, chemical, and general scientific APIs into a uniform agent-callable surface.
  • zeropaper (Auto AI Research Template) — An autonomous research-paper pipeline that uses Claude Code, Codex, or Gemini CLI as the subagent dispatcher.
  • Zochi (Intology) — An end-to-end "artificial scientist" system from Intology, claimed to span hypothesis generation through to peer-reviewed publication.

How to add a project

  1. Copy projects/landscape/sakana-ai-scientist.yml as a template.
  2. Fill in fields per projects/schema.md.
  3. Score it against projects/EVALUATION.md.
  4. Open a pull request.