Project APE¶
external · status: active · focus: end-to-end · discipline: economics · started: 2026
Project page: https://ape.socialcatalystlab.org/
Source: projects/landscape/ape.yml
Positioning¶
An autonomous system that generates empirical economic policy research papers end-to-end from publicly available data, then scores them via a TrueSkill tournament in which AI-generated papers compete head-to-head against peer-reviewed human benchmarks from AER and AEJ:Policy (judged by Gemini 3.1 Flash Lite). The motivating problem: "Most policies — probably millions of them globally — are never rigorously evaluated."
Distinctive contribution¶
Combines autonomous research generation with an explicit human-benchmark tournament — most catalog projects do one or the other. The TrueSkill ranking + AER/AEJ:Policy benchmark set is a reusable evaluation harness; everything (papers, code, data, failures) is published transparently. The project explicitly disclaims policy use: "None of the generated papers have been peer-reviewed and should not be used for evidence-based policy making."
Evaluation scores¶
| Dimension | Score (0–3) | Note |
|---|---|---|
| Lifecycle coverage | 3 | Eight stages from hypothesis through replication-verification; full end-to-end loop. |
| Autonomy level | 3 | Generates papers autonomously without per-step human approval. |
| Architectural transparency | 3 | Stated commitment: 'everything is public — papers, code, data, failures.' TrueSkill scoring + benchmark set are open. |
| Inputs supported | 2 | Policy-question inputs; integrates public-data sources; uses AER / AEJ:Policy corpus as benchmark. |
| Outputs / reproducibility | 3 | Replication-verification step: 'Checks whether code executes and outputs match.' Full artifact transparency by design. |
| Internal evaluation | 3 | Tournament evaluation against peer-reviewed human-benchmark papers from top economics journals — strongest internal-eval design in the catalog. |
| Openness | 3 | Fully open: papers, code, data, failures. |
| Maturity / traction | 1 | Active hosted project in 2026; adoption / external use signals not yet clear. |
| Cross-family policy | 1 | Judge is Gemini 3.1 Flash Lite, distinct from execution back-end — implicit cross-family setup. |
| Runtime assurance | 2 | TrueSkill tournament vs human benchmarks + replication-verification step (code-execution + output-match check). |
| Cross-platform portability | 1 | Hosted service; back-end choice is the maintainers', not user-facing. |
Scored on 2026-05-18. See the evaluation rubric.
Tags¶
Pipeline stages: hypothesis-generation research-design data-acquisition data-analysis code-generation paper-drafting referee-simulation replication
Architectural features: multi-agent tool-use artifact-versioning debate-consensus
Inputs: policy-question
Outputs: paper code data tournament-ranking
Data sources: public-policy-data
Knowledge sources: aer aej-policy
Limitations¶
- Explicit author disclaimer that outputs should not be used for evidence-based policy making.
- Tournament judge (Gemini 3.1 Flash Lite) may itself be biased — meta-evaluation question.
- Economics-policy scope by design; portability to other empirical fields untested.
- Maintainer identity (Social Catalyst Lab) and funding model not publicly detailed.
Related projects in this catalog¶
Related references (literature catalog)¶
- Wu, J. et al. (2025). Agentic Reasoning: A Streamlined Framework for Enhancing LLM Reasoning with Agentic Tools
wu2025agenticreasoning - Filimonovic, D. et al. (2025). Can GenAI Improve Academic Performance? Evidence from the Social and Behavioral Sciences
filimonovic2025genai