Human-in-the-Loop AI Reviewing: Feasibility, Opportunities, and Risks

Summary¶

The authors explore the feasibility, opportunities, and risks of using large language models to review academic submissions while keeping a human in the loop. They experiment with GPT-4 as a reviewer following a conference review form covering contribution, soundness, and presentation, and compare LLM reviews with human reviews. They conclude that current AI-augmented reviewing is sufficiently accurate to alleviate reviewing burden but not completely, and not for all cases, and they identify risks including bias, value misalignment, and misuse.

Contribution¶

Demonstrates feasibility of LLM-augmented reviewing via a GPT-4 experiment, enumerates opportunities and open questions, identifies key risks (bias, value misalignment, misuse), and offers recommendations for managing those risks.

Method¶

Opinion piece with a small-scale demonstration experiment: GPT-4 used as a reviewer against a structured conference review form, with LLM reviews compared to human reviews.

Relevance to RISE¶

Designs an explicit human-in-the-loop architecture for AI-assisted peer review with feasibility analysis. Direct architectural reference for the catalog's review-focused projects, especially those with human-in-loop architectural tags.

Critique / open questions¶

The abstract reports a demonstration with GPT-4 only, without specifying sample size, venue diversity, or a quantitative agreement metric; conclusions about accuracy are necessarily limited.