Evaluating RISE systems¶

How do we tell a good RISE system from a bad one? This is an open methodological question, and the literature reviewed in this knowledge base does not yet provide a settled answer.

This page is not the project-comparison rubric used to score catalog entries — that rubric is at projects/EVALUATION.md. Rather, this page sketches the broader methodological question of evaluating RISE systems as scholarly artifacts, of which the catalog's rubric is one concrete operationalization.

What is being evaluated?¶

A RISE system has at least three evaluable surfaces, and treatments of "AI scientist" evaluation in the literature often blur them:

The pipeline itself — as an information system. Is it well-designed, modular, secure, reproducible, documented?
The artifacts it produces — papers, code, figures, reviews. Are they correct, novel, well-argued, methodologically sound?
The research it enables — counterfactual impact on the scholarly community. Does deploying the system shift what gets asked, what gets answered, what gets cited?

The catalog's rubric scores (1) directly (architectural transparency, openness, reproducibility) and (2) indirectly (via internal_evaluation). (3) — field-level effects — is currently out of scope but is the subject of growing empirical attention (¹, ²).

Output-level evaluation¶

The classical evaluation surface: given a paper, code release, or review produced by a RISE pipeline, how good is it?

Sub-dimensions that recur in the literature:

Faithfulness — does the artifact's argument match what its cited evidence supports? See ³, ⁴.
Factual accuracy — are the empirical claims correct? See the hallucination survey ⁵.
Citation grounding — are cited references real, relevant, and correctly attributed? A core focus of paper-qa and open-scholar.
Methodological soundness — does the artifact respect the norms of its discipline (identification, pre-registration, ethical clearance)?
Novelty — is the contribution incremental, derivative, or genuinely new? A notoriously hard target — see the novelty skill family across the catalog.
Peer-review readiness — would the artifact survive review at a venue appropriate to its claims?

A persistent open problem: the most reliable evaluator of these properties remains a human expert, which is exactly what RISE systems aim to economize on. The catalog's review-focused projects (ape, reviewer, marg) attempt automated approximations; their adequacy is itself a live empirical question.

Process-level evaluation¶

Distinct from the artifact, the process that produced it can be evaluated:

Determinism — given the same inputs and model, does the pipeline produce the same output?
Auditability — can intermediate artifacts (prompts, tool calls, decisions) be inspected post-hoc?
Reproducibility — can a third party re-run the pipeline and recover the published artifacts?
Failure modes — does the pipeline fail loudly (visible error) or silently (plausible-but-wrong output)? RISE systems exhibit a high prevalence of the latter — see ⁶ on reasoning faithfulness.

Field-level evaluation¶

The most ambitious — and most under-developed — evaluation surface. If RISE systems are deployed at scale, what are the field-level consequences?

Composition shifts. Does the kind of research that gets done change (¹, ⁷)?
Quality-quantity tradeoffs. Does more output mean lower marginal quality? ² develop this argument for peer review specifically; the parallel question for publication is open.
Discipline effects. Does the IS or economics literature reorganize around what RISE makes cheap? Cf. ⁸, ⁹.
Epistemic trust. How do readers and reviewers calibrate confidence in artifacts of unclear human/agentic provenance? Related: ¹⁰, ¹¹.

Benchmarks vs. case studies¶

A current methodological tension: benchmark-driven evaluation (numerical scores on fixed tasks) is tractable and comparable across systems but under-captures what matters for scholarship (novelty, methodological soundness, peer-review readiness). Case studies (a system deployed on a real research project) demonstrate end-to-end fitness but resist generalization.

Two compromises are visible in the catalog:

Replication-task evaluations — using the reproduction of a published paper as a stand-in for end-to-end fitness. The social-science-replicability project pursues this directly.
Environment-based evaluations — aviary and similar provide standardized scientific-task environments that bridge benchmark comparability and task realism.

Neither is settled. A robust evaluation methodology for RISE remains one of the field's open problems — and a natural locus for the discipline's next contributions.

Filimonovic, D., Rutzer, C., & Wunsch, C. (2025). Can GenAI improve academic performance? Evidence from the social and behavioral sciences. https://arxiv.org/abs/2510.02408 ↩↩
Gartenberg, C., Murray, F., Hasan, S., & Pierce, L. (2026). More versus better: Artificial intelligence, incentives, and the emerging crisis in peer review. Organization Science, 37(3). https://doi.org/10.1287/orsc.2026.ed.v37.n3 ↩↩
Matton, K. et al. (2025). Walk the talk? Measuring the faithfulness of large language model explanations. International Conference on Learning Representations (ICLR). https://arxiv.org/abs/2504.14150 ↩
Maynez, J. et al. (2020). On faithfulness and factuality in abstractive summarization. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.173 ↩
Ji, Z. et al. (2023). Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12), 1–38. https://doi.org/10.1145/3571730 ↩
Chen, Y. et al. (2025). Reasoning models don’t always say what they think. https://arxiv.org/abs/2505.05410 ↩
Brodeur, A., Sung, S. Y., et al. (2025). Assessing reproducibility in economics using standardized crowd-sourced analysis (NBER Working Paper w33753). National Bureau of Economic Research. https://www.nber.org/papers/w33753 ↩
Gopal, R. D. et al. (2025). Inventing with machines: Generative AI and the evolving landscape of IS research. Information Systems Research, 36(4), 1949–1967. https://doi.org/10.1287/isre.2025.editorial.v36.n4 ↩
Abbasi, A. et al. (2026). ISR special issue: Generative AI and new methods of inquiry in information systems research. INFORMS Information Systems Research, Call for Papers. https://pubsonline.informs.org/page/isre/calls-for-papers ↩
Peter, S., Riemer, K., & West, J. D. (2025). The benefits and dangers of anthropomorphic conversational agents. Proceedings of the National Academy of Sciences, 122(22), e2415898122. https://doi.org/10.1073/pnas.2415898122 ↩
Riemer, K., & Peter, S. (2024). Conceptualizing generative AI as style engines: Application archetypes and implications. International Journal of Information Management, 79, 102824. https://doi.org/10.1016/j.ijinfomgt.2024.102824 ↩