AstaBench (AI2)¶

external · status: active · focus: end-to-end · discipline: general · started: 2025

Project page: https://github.com/allenai/asta-bench

Source: projects/landscape/asta-bench.yml

Positioning¶

An evaluation framework from AI2 for measuring scientific-research abilities of AI agents. 2,400+ examples across 11 benchmarks covering literature search, code execution, data analysis, and end-to-end discovery. Built on the InspectAI framework. Sits in the RISE evaluation infrastructure layer alongside Aviary and MLGym.

Distinctive contribution¶

The most-scoped benchmark suite for scholarly-research agent abilities specifically — not generic agent benchmarks, not domain-specific science tasks (cf. BixBench), but a curated spectrum of research skills with standardized tools and execution environments for fair efficiency-comparable runs.

Evaluation scores¶

Dimension	Score (0–3)	Note
Lifecycle coverage	0	Evaluation infrastructure; does not produce scholarship itself.
Autonomy level	2	Supervised: user submits an agent; AstaBench scores it across 11 tasks.
Architectural transparency	3	Open under Apache-2.0; AstaBench paper at allenai.org; built on documented InspectAI framework.
Inputs supported	3	11 benchmarks spanning research skills; standardized agent interface; leaderboard submission supported.
Outputs / reproducibility	3	Docker-based execution; standardized scoring; decoupled solve/score for cross-version comparison.
Internal evaluation	2	Self-application: AI2 uses AstaBench to evaluate its own agents; broader cross-system results on leaderboard.
Openness	3	Apache-2.0; AI2 institutional backing; public leaderboard.
Maturity / traction	2	104 stars; active development; AI2 institutional backing; recent (2025–2026).
Cross-family policy	1	InspectAI framework allows cross-family agent submissions; not a policy on the system itself.
Runtime assurance	1	Per-task scoring against rubrics; not a runtime audit during pipeline execution.
Cross-platform portability	2	InspectAI compatibility + Docker + decoupled solve/score paths.

Scored on 2026-05-18. See the evaluation rubric.

Tags¶

Pipeline stages: literature-discovery data-analysis code-generation

Architectural features: tool-use artifact-versioning

Inputs: task-specification agent-implementation

Outputs: agent-trajectories leaderboard-submissions efficiency-metrics

Data sources: benchmark-tasks

Knowledge sources: paper-corpus

Limitations¶

Evaluation infrastructure — value depends on downstream agent systems being benchmarked.
Requires Docker for the recommended path.
Sub-tasks vary in maturity; the 11-benchmark scope is broad but not exhaustive.

Papers describing this project¶

AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite — Bragg, J., D'Arcy, M., Balepur, N., Bareket, D., Dalvi, B., Feldman, S., et al. (2025). arXiv. arXiv:2510.21652

Wu, J. et al. (2025). Agentic Reasoning: A Streamlined Framework for Enhancing LLM Reasoning with Agentic Tools wu2025agenticreasoning