result-to-claim¶
referee-simulationResult-to-Claim Gate¶
Experiments produce numbers; this gate decides what those numbers mean. Collect results from available sources, get a Codex judgment, then auto-route based on the verdict.
Context: $ARGUMENTS¶
When to Use¶
- After a set of experiments completes (main results, not just sanity checks)
- Before committing to claims in a paper or review response
- When results are ambiguous and you need an objective second opinion
Workflow¶
Step 1: Collect Results¶
Gather experiment data from whatever sources are available in the project:
- W&B (preferred):
wandb.Api().run("<entity>/<project>/<run_id>").history()— metrics, training curves, comparisons - EXPERIMENT_LOG.md: full results table with baselines and verdicts
- EXPERIMENT_TRACKER.md: check which experiments are DONE vs still running
- Log files:
ssh server "tail -100 /path/to/training.log"if no other source - docs/research_contract.md: intended claims and experiment design
Assemble the key information: - What experiments were run (method, dataset, config) - Main metrics and baseline comparisons (deltas) - The intended claim these experiments were designed to test - Any known confounds or caveats
Step 2: Codex Judgment¶
Send the collected results to Codex for objective evaluation:
mcp__codex__codex:
config: {"model_reasoning_effort": "xhigh"}
prompt: |
RESULT-TO-CLAIM EVALUATION
I need you to judge whether experimental results support the intended claim.
Intended claim: [the claim these experiments test]
Experiments run:
[list experiments with method, dataset, metrics]
Results:
[paste key numbers, comparison deltas, significance]
Baselines:
[baseline numbers and sources — reproduced or from paper]
Known caveats:
[any confounding factors, limited datasets, missing comparisons]
Please evaluate:
1. claim_supported: yes | partial | no
2. what_results_support: what the data actually shows
3. what_results_dont_support: where the data falls short of the claim
4. missing_evidence: specific evidence gaps
5. suggested_claim_revision: if the claim should be strengthened, weakened, or reframed
6. next_experiments_needed: specific experiments to fill gaps (if any)
7. confidence: high | medium | low
Be honest. Do not inflate claims beyond what the data supports.
A single positive result on one dataset does not support a general claim.
Step 3: Parse and Normalize¶
Extract structured fields from Codex response:
- claim_supported: yes | partial | no
- what_results_support: "..."
- what_results_dont_support: "..."
- missing_evidence: "..."
- suggested_claim_revision: "..."
- next_experiments_needed: "..."
- confidence: high | medium | low
Step 3.5: Check Experiment Integrity (if audit exists)¶
Skip this step if EXPERIMENT_AUDIT.json does not exist.
if EXPERIMENT_AUDIT.json exists:
read integrity_status from file
attach to verdict output:
integrity_status: pass | warn | fail
if integrity_status == "fail":
append to verdict: "[INTEGRITY CONCERN] — audit found issues, see EXPERIMENT_AUDIT.md"
downgrade confidence to "low" regardless of Codex judgment
if integrity_status == "warn":
append to verdict: "[INTEGRITY: WARN] — audit flagged potential issues"
else:
integrity_status = "unavailable"
verdict is labeled "provisional — no integrity audit run"
(this does NOT block anything — pipeline continues normally)
See shared-references/experiment-integrity.md for the full integrity protocol.
Step 4: Route Based on Verdict¶
no — Claim not supported¶
- Record postmortem in findings.md (Research Findings section):
- What was tested, what failed, hypotheses for why
- Constraints for future attempts (what NOT to try again)
- Update CLAUDE.md Pipeline Status
- Decide whether to pivot to next idea from IDEA_CANDIDATES.md or try an alternative approach
partial — Claim partially supported¶
- Update the working claim to reflect what IS supported
- Record the gap in findings.md
- Design and run supplementary experiments to fill evidence gaps
- Re-run result-to-claim after supplementary experiments complete
- Multiple rounds of
partialon the same claim → record analysis in findings.md, consider whether to narrow the claim scope or switch ideas
yes — Claim supported¶
- Record confirmed claim in project notes
- If ablation studies are incomplete → trigger
/ablation-planner - If all evidence is in → ready for paper writing
Step 5: Update Research Wiki (if active)¶
Skip this step entirely if research-wiki/ does not exist.
If research-wiki/ exists, resolve $WIKI_SCRIPT per the canonical
chain documented in
shared-references/wiki-helper-resolution.md
(Variant B — warn-and-skip for caller skills). The verdict / claim
status / idea-outcome page edits below run on raw markdown and don't
need the helper, but edges, query-pack rebuild, and the log line do.
cd "$(git rev-parse --show-toplevel 2>/dev/null || pwd)" || exit 1
ARIS_REPO="${ARIS_REPO:-$(awk -F'\t' '$1=="repo_root"{print $2; exit}' .aris/installed-skills.txt 2>/dev/null)}"
WIKI_SCRIPT=".aris/tools/research_wiki.py"
[ -f "$WIKI_SCRIPT" ] || WIKI_SCRIPT="tools/research_wiki.py"
[ -f "$WIKI_SCRIPT" ] || { [ -n "${ARIS_REPO:-}" ] && WIKI_SCRIPT="$ARIS_REPO/tools/research_wiki.py"; }
[ -f "$WIKI_SCRIPT" ] || {
echo "WARN: research_wiki.py not found; verdict will be reported but wiki edges/query-pack/log will be skipped. Fix: bash tools/install_aris.sh, export ARIS_REPO, or cp <ARIS-repo>/tools/research_wiki.py tools/." >&2
WIKI_SCRIPT=""
}
if research-wiki/ exists:
# 1. Create experiment page
Create research-wiki/experiments/<exp_id>.md with:
- node_id: exp:<id>
- idea_id: idea:<active_idea>
- date, hardware, duration, metrics
- verdict, confidence, reasoning summary
# 2. Update claim status (page edits run unconditionally; edges only if $WIKI_SCRIPT resolved)
for each claim resolved by this verdict:
if verdict == "yes":
Update claim page: status → supported
[ -n "$WIKI_SCRIPT" ] && python3 "$WIKI_SCRIPT" add_edge research-wiki/ --from "exp:<id>" --to "claim:<cid>" --type supports --evidence "<metric>"
elif verdict == "partial":
Update claim page: status → partial
[ -n "$WIKI_SCRIPT" ] && python3 "$WIKI_SCRIPT" add_edge research-wiki/ --from "exp:<id>" --to "claim:<cid>" --type supports --evidence "partial"
else:
Update claim page: status → invalidated
[ -n "$WIKI_SCRIPT" ] && python3 "$WIKI_SCRIPT" add_edge research-wiki/ --from "exp:<id>" --to "claim:<cid>" --type invalidates --evidence "<why>"
# 3. Update idea outcome (raw markdown, helper-free)
Update research-wiki/ideas/<idea_id>.md:
- outcome: positive | mixed | negative
- If negative: fill "Failure / Risk Notes" and "Lessons Learned"
- If positive: fill "Actual Outcome" and "Reusable Components"
# 4. Rebuild + log (only if $WIKI_SCRIPT resolved)
[ -n "$WIKI_SCRIPT" ] && python3 "$WIKI_SCRIPT" rebuild_query_pack research-wiki/
[ -n "$WIKI_SCRIPT" ] && python3 "$WIKI_SCRIPT" log research-wiki/ "result-to-claim: exp:<id> verdict=<verdict> for idea:<idea_id>"
# 5. Re-ideation suggestion
Count failed/partial ideas since last /idea-creator run.
If >= 3: print "💡 3+ ideas tested since last ideation. Consider re-running /idea-creator — the wiki now knows what doesn't work."
Rules¶
- Codex is the judge, not CC. CC collects evidence and routes; Codex evaluates. This prevents post-hoc rationalization.
- Do not inflate claims beyond what the data supports. If Codex says "partial", do not round up to "yes".
- A single positive result on one dataset does not support a general claim. Be honest about scope.
- If
confidenceis low, treat the judgment as inconclusive and add experiments rather than committing to a claim. - If Codex MCP is unavailable (call fails), CC makes its own judgment and marks it
[pending Codex review]— do not block the pipeline. - Always record the verdict and reasoning in findings.md, regardless of outcome.
Review Tracing¶
After each mcp__codex__codex or mcp__codex__codex-reply reviewer call, save the trace following shared-references/review-tracing.md (Policy C — forensic; never silently skip). Use save_trace.sh (resolved per the chain in shared-references/integration-contract.md §2) or write files directly to .aris/traces/<skill>/<date>_run<NN>/. Respect the --- trace: parameter (default: full).