Project APE¶

external · status: active · focus: end-to-end · discipline: economics · started: 2026

Project page: https://ape.socialcatalystlab.org/

Source: projects/landscape/ape.yml

Positioning¶

An autonomous system that generates empirical economic policy research papers end-to-end from publicly available data, then scores them via a TrueSkill tournament in which AI-generated papers compete head-to-head against peer-reviewed human benchmarks from AER and AEJ:Policy (judged by Gemini 3.1 Flash Lite). The motivating problem: "Most policies — probably millions of them globally — are never rigorously evaluated."

Distinctive contribution¶

Combines autonomous research generation with an explicit human-benchmark tournament — most catalog projects do one or the other. The TrueSkill ranking + AER/AEJ:Policy benchmark set is a reusable evaluation harness; everything (papers, code, data, failures) is published transparently. The project explicitly disclaims policy use: "None of the generated papers have been peer-reviewed and should not be used for evidence-based policy making."

Evaluation scores¶

Dimension	Score (0–3)	Note
Lifecycle coverage	3	Eight stages from hypothesis through replication-verification; full end-to-end loop.
Autonomy level	3	Generates papers autonomously without per-step human approval.
Architectural transparency	3	Stated commitment: 'everything is public — papers, code, data, failures.' TrueSkill scoring + benchmark set are open.
Inputs supported	2	Policy-question inputs; integrates public-data sources; uses AER / AEJ:Policy corpus as benchmark.
Outputs / reproducibility	3	Replication-verification step: 'Checks whether code executes and outputs match.' Full artifact transparency by design.
Internal evaluation	3	Tournament evaluation against peer-reviewed human-benchmark papers from top economics journals — strongest internal-eval design in the catalog.
Openness	3	Fully open: papers, code, data, failures.
Maturity / traction	1	Active hosted project in 2026; adoption / external use signals not yet clear.
Cross-family policy	1	Judge is Gemini 3.1 Flash Lite, distinct from execution back-end — implicit cross-family setup.
Runtime assurance	2	TrueSkill tournament vs human benchmarks + replication-verification step (code-execution + output-match check).
Cross-platform portability	1	Hosted service; back-end choice is the maintainers', not user-facing.

Scored on 2026-05-18. See the evaluation rubric.

Tags¶

Pipeline stages: hypothesis-generation research-design data-acquisition data-analysis code-generation paper-drafting referee-simulation replication

Architectural features: multi-agent tool-use artifact-versioning debate-consensus

Inputs: policy-question

Outputs: paper code data tournament-ranking

Data sources: public-policy-data

Knowledge sources: aer aej-policy

Limitations¶

Explicit author disclaimer that outputs should not be used for evidence-based policy making.
Tournament judge (Gemini 3.1 Flash Lite) may itself be biased — meta-evaluation question.
Economics-policy scope by design; portability to other empirical fields untested.
Maintainer identity (Social Catalyst Lab) and funding model not publicly detailed.

Wu, J. et al. (2025). Agentic Reasoning: A Streamlined Framework for Enhancing LLM Reasoning with Agentic Tools wu2025agenticreasoning
Filimonovic, D. et al. (2025). Can GenAI Improve Academic Performance? Evidence from the Social and Behavioral Sciences filimonovic2025genai