Projects catalog¶
This catalog evaluates agentic-research systems against the
standard rubric.
Vocabularies for stages, architectural features, focus, and
disciplinary scope are defined in
projects/VOCABULARY.md.
The matrix and per-project pages below are auto-generated from
projects/*.yml and projects/landscape/*.yml by
scripts/build_indexes.py. Do not edit by hand — edit the YAML
sources.
Comparison matrix¶
| Project | Type | Focus | LC | AUT | ARC | IN | OUT | EVAL | OPEN | MAT | XF | RUN | PORT | Discipline |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| E2ER — End-to-End Research | owned | end-to-end |
3 | 2 | 2 | 3 | 2 | 1 | 2 | 1 | 0 | 2 | 1 | economics |
| Academic Research Skills (ARS) | external | end-to-end |
2 | 1 | 3 | 3 | 3 | 3 | 2 | 3 | 1 | 3 | 3 | general |
| Agent Laboratory | external | end-to-end |
3 | 2 | 3 | 2 | 2 | 2 | 3 | 3 | 0 | 1 | 1 | computer-science |
| AlphaEvolve (Google DeepMind) | external | end-to-end |
1 | 3 | 1 | 1 | 2 | 3 | 0 | 2 | 0 | 3 | 0 | mathematics |
| Project APE | external | end-to-end |
3 | 3 | 3 | 2 | 3 | 3 | 3 | 1 | 1 | 2 | 1 | economics |
| ARIS (Auto-Research-In-Sleep) | external | end-to-end |
3 | 3 | 3 | 3 | 2 | 3 | 3 | 3 | 2 | 3 | 3 | computer-science |
| AstaBench (AI2) | external | end-to-end |
0 | 2 | 3 | 3 | 3 | 2 | 3 | 2 | 1 | 1 | 2 | general |
| AutoResearchClaw | external | end-to-end |
3 | 2 | 2 | 3 | 2 | 2 | 3 | 3 | 1 | 3 | 3 | general |
| AutoSurvey | external | literature |
1 | 3 | 2 | 1 | 2 | 3 | 1 | 1 | 0 | 1 | 1 | general |
| Aviary (FutureHouse) | external | end-to-end |
0 | 2 | 3 | 2 | 3 | 2 | 3 | 2 | 1 | 1 | 2 | general |
| Clo-Author | external | end-to-end |
3 | 2 | 3 | 2 | 2 | 2 | 1 | 1 | 0 | 2 | 1 | economics |
| Coarse (coarse.ink) | external | review |
0 | 2 | 2 | 1 | 1 | 1 | 3 | 1 | 1 | 1 | 2 | general |
| CORAL | external | end-to-end |
2 | 3 | 3 | 2 | 2 | 2 | 3 | 2 | 1 | 2 | 2 | general |
| data-to-paper | external | end-to-end |
2 | 3 | 3 | 2 | 3 | 3 | 3 | 2 | 0 | 2 | 1 | general |
| DeepResearcher (GAIR-NLP) | external | literature |
1 | 3 | 3 | 2 | 2 | 3 | 3 | 2 | 0 | 2 | 1 | general |
| EvoScientist | external | end-to-end |
3 | 3 | 3 | 2 | 2 | 3 | 3 | 3 | 1 | 2 | 3 | general |
| GPT Researcher | external | literature |
1 | 3 | 3 | 2 | 2 | 1 | 3 | 3 | 0 | 1 | 2 | general |
| Kosmos (jimmc414 implementation) | external | end-to-end |
2 | 3 | 3 | 2 | 2 | 2 | 1 | 2 | 1 | 2 | 1 | general |
| MARG (Multi-Agent Review Generation) | external | review |
0 | 2 | 3 | 1 | 3 | 2 | 3 | 1 | 0 | 1 | 0 | general |
| MLGym (Meta) | external | end-to-end |
0 | 2 | 3 | 2 | 3 | 2 | 2 | 2 | 0 | 1 | 1 | computer-science |
| Open CoScientist Agents | external | ideation |
1 | 3 | 3 | 2 | 1 | 1 | 3 | 1 | 3 | 2 | 1 | general |
| OpenScholar (AI2) | external | literature |
0 | 2 | 3 | 2 | 2 | 3 | 3 | 2 | 1 | 1 | 1 | general |
| PaperQA2 (FutureHouse) | external | literature |
0 | 2 | 3 | 2 | 2 | 3 | 3 | 3 | 1 | 3 | 3 | general |
| PaperCoder (Paper2Code) | external | replication |
1 | 3 | 3 | 2 | 3 | 3 | 3 | 3 | 0 | 2 | 2 | computer-science |
| RECAST (Replication and Extension with Causal AI Statistical Toolkit) | external | replication |
2 | 2 | 3 | 2 | 2 | 2 | 3 | 1 | 0 | 3 | 1 | econometrics |
| Refine (refine.ink) | external | review |
0 | 2 | 1 | 1 | 1 | 1 | 0 | 2 | 0 | 1 | 0 | general |
| ResearchTown | external | ideation |
2 | 3 | 3 | 2 | 2 | 2 | 3 | 2 | 0 | 1 | 1 | general |
| ResearchAgent (NAACL 2025) | external | ideation |
1 | 2 | 3 | 2 | 2 | 2 | 1 | 1 | 0 | 2 | 0 | general |
| Reviewer (Ingar30) | external | review |
0 | 2 | 3 | 1 | 2 | 1 | 3 | 1 | 0 | 2 | 0 | economics |
| Robin (FutureHouse) | external | end-to-end |
2 | 2 | 3 | 2 | 1 | 2 | 2 | 2 | 1 | 2 | 2 | biomedical |
| Sakana AI Scientist v2 | external | end-to-end |
2 | 3 | 3 | 1 | 2 | 2 | 3 | 2 | 0 | 1 | 0 | computer-science |
| Sakana AI Scientist (v1) | external | end-to-end |
2 | 3 | 3 | 1 | 2 | 2 | 2 | 3 | 0 | 1 | 0 | computer-science |
| Social Science Replicability Infrastructure | external | replication |
1 | 2 | 2 | 2 | 2 | 1 | 3 | 1 | 0 | 2 | 1 | social-sciences |
| STORM / Co-STORM | external | literature |
1 | 2 | 3 | 2 | 2 | 2 | 3 | 3 | 0 | 1 | 2 | general |
| SurveyX | external | literature |
1 | 3 | 2 | 1 | 1 | 2 | 1 | 2 | 0 | 1 | 1 | general |
| Tongyi DeepResearch | external | literature |
1 | 3 | 3 | 2 | 2 | 3 | 3 | 3 | 0 | 1 | 2 | general |
| ToolUniverse | external | end-to-end |
0 | 2 | 3 | 3 | 2 | 2 | 3 | 2 | 1 | 2 | 2 | biomedical |
| zeropaper (Auto AI Research Template) | external | end-to-end |
3 | 3 | 2 | 2 | 2 | 2 | 1 | 1 | 1 | 3 | 1 | finance |
| Zochi (Intology) | external | end-to-end |
3 | 3 | 2 | 2 | 2 | 3 | 2 | 2 | 0 | 2 | 1 | computer-science |
Score columns: LC = lifecycle coverage, AUT = autonomy, ARC = architectural transparency, IN = inputs supported, OUT = outputs/reproducibility, EVAL = internal evaluation, OPEN = openness, MAT = maturity/traction, XF = cross-family policy, RUN = runtime assurance, PORT = cross-platform portability. Scale 0–3. See the evaluation rubric.
One-line summaries¶
- E2ER — End-to-End Research — E2ER is a strategist-driven agentic research pipeline that takes a research idea (human- or agent-supplied) and carries it through literature synthesis, identification, data acquisition, analysis, and paper drafting.
- Academic Research Skills (ARS) — A comprehensive Claude Code plugin suite (v3.9.0 at scoring date) for the academic research pipeline: literature → write → review → revise → finalize.
- Agent Laboratory — An end-to-end autonomous research workflow (arXiv:2501.04227) that guides a research idea through three phases — literature review, experimentation, and report writing — with specialized LLM-driven agents and external tools (arXiv, Hugging Face, Python, LaTeX).
- AlphaEvolve (Google DeepMind) — A Gemini-powered evolutionary coding agent that combines LLM generative capabilities with automated evaluators in an iterative propose-test-refine loop.
- Project APE — An autonomous system that generates empirical economic policy research papers end-to-end from publicly available data, then scores them via a TrueSkill tournament in which AI-generated papers compete head-to-head against peer-reviewed human benchmarks from AER and AEJ:Policy (judged by Gemini 3.1 Flash Lite).
- ARIS (Auto-Research-In-Sleep) — An open-source research harness for autonomous ML research (arXiv:2605.03042) built around cross-model adversarial collaboration: an executor model drives forward progress while a reviewer from a different model family critiques intermediate artifacts and requests revisions.
- AstaBench (AI2) — An evaluation framework from AI2 for measuring scientific-research abilities of AI agents.
- AutoResearchClaw — An autonomous research pipeline taking a chat-level idea to a full paper via ACP-compatible agent back-ends (Claude Code, Codex CLI, Copilot CLI, Gemini CLI, Kimi CLI).
- AutoSurvey — A NeurIPS 2024 framework (arXiv:2406.10252) for automatically generating comprehensive literature surveys from a topic and a paper database.
- Aviary (FutureHouse) — A gymnasium for defining custom language-agent environments (arXiv:2412.21154), with pre-built environments for math, general knowledge, biological sequences, scientific literature search, and protein stability.
- Clo-Author — A Claude Code scaffold for empirical economics research, spanning literature review through journal submission.
- Coarse (coarse.ink) — A web-based AI peer-review service: users upload academic papers (up to 50 MB) and receive AI-generated referee reports with 20+ detailed comments.
- CORAL — Infrastructure (arXiv:2604.01658) for multi-agent autonomous self-evolution — organizations of AI agents that run experiments, share knowledge through persistent stores, and continuously improve solutions against a user-supplied grading script.
- data-to-paper — An end-to-end framework that takes annotated data and produces backward-traceable scientific manuscripts: every numeric value in the output can be click-traced to the specific code line that generated it.
- DeepResearcher (GAIR-NLP) — An end-to-end RL-trained deep-research agent (arXiv:2504.03160) that learns to plan, retrieve, cross-validate, and self-reflect via reinforcement learning in real-world web environments rather than in simulated retrieval.
- EvoScientist — A self-evolving AI scientist system (arXiv:2603.08127) built on the DeepAgents framework.
- GPT Researcher — An autonomous "deep research" agent that produces long-form, cited reports on any topic from web and local sources.
- Kosmos (jimmc414 implementation) — An open-source implementation of the Kosmos AI scientist architecture (Lu et al., arXiv:2511.02824), adapted to run via Claude Code or the Anthropic / OpenAI APIs.
- MARG (Multi-Agent Review Generation) — A research artifact (arXiv:2401.04259) and reusable demo for generating peer reviews of scientific papers using multiple specialized agents.
- MLGym (Meta) — A gym-style framework and benchmark (MLGym-Bench, arXiv:2502.14499) for advancing AI research agents on 13 diverse ML research tasks (CV, NLP, RL, game theory).
- Open CoScientist Agents — An open-source implementation of Google DeepMind's AI co-scientist (arXiv:2502.18864), built on LangGraph and GPT Researcher.
- OpenScholar (AI2) — A retrieval-augmented LM designed to answer scientific queries by searching the literature and generating responses grounded in sources.
- PaperQA2 (FutureHouse) — A high-accuracy retrieval-augmented generation package focused on scientific PDFs (and Office docs, source code).
- PaperCoder (Paper2Code) — An ICLR 2026 multi-agent system (arXiv:2504.17192) that transforms a machine-learning paper into a working code repository via a three-stage pipeline (planning, analysis, code generation) with specialized agents per stage.
- RECAST (Replication and Extension with Causal AI Statistical Toolkit) — An end-to-end autonomous pipeline for the replication + extension + peer-review arc of the RISE concept diagram.
- Refine (refine.ink) — A commercial AI peer-review service that produces reviewer-grade feedback on academic papers within ~20–40 minutes by running multi-hour parallel compute jobs (~2+ hours per review).
- ResearchTown — An ICML 2025 multi-agent platform for community-level automatic research simulation.
- ResearchAgent (NAACL 2025) — The NAACL 2025 reference implementation (arXiv:2404.07738) of iterative research idea generation over scientific literature.
- Reviewer (Ingar30) — A reproducible multi-agent reviewer for academic economics papers.
- Robin (FutureHouse) — A multi-agent system for automating scientific discovery (arXiv:2505.13400), with explicit support for hypothesis generation, experiment design, and data analysis.
- Sakana AI Scientist v2 — An autonomous "AI scientist" pipeline that ideates, runs experiments (primarily ML), drafts a paper, and self-reviews.
- Sakana AI Scientist (v1) — The original AI Scientist release (arXiv:2408.06292): an end-to-end agentic pipeline that ideates, runs experiments, and writes a paper with self-review on a fixed set of CS templates (NanoGPT, 2D Diffusion, Grokking).
- Social Science Replicability Infrastructure — Infrastructure aimed at the replication stage of the RISE pipeline: given a published paper, attempt to reproduce its empirical results in an automated or semi-automated fashion.
- STORM / Co-STORM — An LLM-powered knowledge-curation system that writes Wikipedia-style long-form articles from web search.
- SurveyX — An academic survey-automation system (arXiv:2502.14776) that generates domain-specific surveys from a paper title plus retrieval keywords.
- Tongyi DeepResearch — An agentic large language model purpose-built for long-horizon deep-information-seeking tasks (arXiv:2510.24701), shipped both as open weights (30.5B total / 3.3B active) and as inference code with ReAct and 'Heavy' (IterResearch) modes.
- ToolUniverse — A curated tool registry and MCP server (arXiv:2509.23426) that packages biomedical, chemical, and general scientific APIs into a uniform agent-callable surface.
- zeropaper (Auto AI Research Template) — An autonomous research-paper pipeline that uses Claude Code, Codex, or Gemini CLI as the subagent dispatcher.
- Zochi (Intology) — An end-to-end "artificial scientist" system from Intology, claimed to span hypothesis generation through to peer-reviewed publication.
How to add a project¶
- Copy
projects/landscape/sakana-ai-scientist.ymlas a template. - Fill in fields per
projects/schema.md. - Score it against
projects/EVALUATION.md. - Open a pull request.