`result-to-claim`¶

Pack: ARIS skills

Category: audit

Field: —

License: MIT

Updated: 2026-05-18

Stages: referee-simulation

↗ view SKILL.md on source · GitHub stars

Result-to-Claim Gate¶

Experiments produce numbers; this gate decides what those numbers mean. Collect results from available sources, get a Codex judgment, then auto-route based on the verdict.

Context: $ARGUMENTS¶

When to Use¶

After a set of experiments completes (main results, not just sanity checks)
Before committing to claims in a paper or review response
When results are ambiguous and you need an objective second opinion

Workflow¶

Step 1: Collect Results¶

Gather experiment data from whatever sources are available in the project:

W&B (preferred): wandb.Api().run("<entity>/<project>/<run_id>").history() — metrics, training curves, comparisons
EXPERIMENT_LOG.md: full results table with baselines and verdicts
EXPERIMENT_TRACKER.md: check which experiments are DONE vs still running
Log files: ssh server "tail -100 /path/to/training.log" if no other source
docs/research_contract.md: intended claims and experiment design

Assemble the key information: - What experiments were run (method, dataset, config) - Main metrics and baseline comparisons (deltas) - The intended claim these experiments were designed to test - Any known confounds or caveats

Step 2: Codex Judgment¶

Send the collected results to Codex for objective evaluation:

Text Only

mcp__codex__codex:
  config: {"model_reasoning_effort": "xhigh"}
  prompt: |
    RESULT-TO-CLAIM EVALUATION

    I need you to judge whether experimental results support the intended claim.

    Intended claim: [the claim these experiments test]

    Experiments run:
    [list experiments with method, dataset, metrics]

    Results:
    [paste key numbers, comparison deltas, significance]

    Baselines:
    [baseline numbers and sources — reproduced or from paper]

    Known caveats:
    [any confounding factors, limited datasets, missing comparisons]

    Please evaluate:
    1. claim_supported: yes | partial | no
    2. what_results_support: what the data actually shows
    3. what_results_dont_support: where the data falls short of the claim
    4. missing_evidence: specific evidence gaps
    5. suggested_claim_revision: if the claim should be strengthened, weakened, or reframed
    6. next_experiments_needed: specific experiments to fill gaps (if any)
    7. confidence: high | medium | low

    Be honest. Do not inflate claims beyond what the data supports.
    A single positive result on one dataset does not support a general claim.

Step 3: Parse and Normalize¶

Extract structured fields from Codex response:

Markdown

- claim_supported: yes | partial | no
- what_results_support: "..."
- what_results_dont_support: "..."
- missing_evidence: "..."
- suggested_claim_revision: "..."
- next_experiments_needed: "..."
- confidence: high | medium | low

Step 3.5: Check Experiment Integrity (if audit exists)¶

Skip this step if EXPERIMENT_AUDIT.json does not exist.

Text Only

if EXPERIMENT_AUDIT.json exists:
    read integrity_status from file
    attach to verdict output:
        integrity_status: pass | warn | fail

    if integrity_status == "fail":
        append to verdict: "[INTEGRITY CONCERN] — audit found issues, see EXPERIMENT_AUDIT.md"
        downgrade confidence to "low" regardless of Codex judgment

    if integrity_status == "warn":
        append to verdict: "[INTEGRITY: WARN] — audit flagged potential issues"
else:
    integrity_status = "unavailable"
    verdict is labeled "provisional — no integrity audit run"
    (this does NOT block anything — pipeline continues normally)

See shared-references/experiment-integrity.md for the full integrity protocol.

Step 4: Route Based on Verdict¶

`no` — Claim not supported¶

Record postmortem in findings.md (Research Findings section):
What was tested, what failed, hypotheses for why
Constraints for future attempts (what NOT to try again)
Update CLAUDE.md Pipeline Status
Decide whether to pivot to next idea from IDEA_CANDIDATES.md or try an alternative approach

`partial` — Claim partially supported¶

Update the working claim to reflect what IS supported
Record the gap in findings.md
Design and run supplementary experiments to fill evidence gaps
Re-run result-to-claim after supplementary experiments complete
Multiple rounds of partial on the same claim → record analysis in findings.md, consider whether to narrow the claim scope or switch ideas

`yes` — Claim supported¶

Record confirmed claim in project notes
If ablation studies are incomplete → trigger /ablation-planner
If all evidence is in → ready for paper writing

Step 5: Update Research Wiki (if active)¶

Skip this step entirely if research-wiki/ does not exist.

If research-wiki/ exists, resolve $WIKI_SCRIPT per the canonical chain documented in shared-references/wiki-helper-resolution.md (Variant B — warn-and-skip for caller skills). The verdict / claim status / idea-outcome page edits below run on raw markdown and don't need the helper, but edges, query-pack rebuild, and the log line do.

Bash

cd "$(git rev-parse --show-toplevel 2>/dev/null || pwd)" || exit 1
ARIS_REPO="${ARIS_REPO:-$(awk -F'\t' '$1=="repo_root"{print $2; exit}' .aris/installed-skills.txt 2>/dev/null)}"
WIKI_SCRIPT=".aris/tools/research_wiki.py"
[ -f "$WIKI_SCRIPT" ] || WIKI_SCRIPT="tools/research_wiki.py"
[ -f "$WIKI_SCRIPT" ] || { [ -n "${ARIS_REPO:-}" ] && WIKI_SCRIPT="$ARIS_REPO/tools/research_wiki.py"; }
[ -f "$WIKI_SCRIPT" ] || {
  echo "WARN: research_wiki.py not found; verdict will be reported but wiki edges/query-pack/log will be skipped. Fix: bash tools/install_aris.sh, export ARIS_REPO, or cp <ARIS-repo>/tools/research_wiki.py tools/." >&2
  WIKI_SCRIPT=""
}

Text Only

if research-wiki/ exists:
    # 1. Create experiment page
    Create research-wiki/experiments/<exp_id>.md with:
      - node_id: exp:<id>
      - idea_id: idea:<active_idea>
      - date, hardware, duration, metrics
      - verdict, confidence, reasoning summary

    # 2. Update claim status (page edits run unconditionally; edges only if $WIKI_SCRIPT resolved)
    for each claim resolved by this verdict:
        if verdict == "yes":
            Update claim page: status → supported
            [ -n "$WIKI_SCRIPT" ] && python3 "$WIKI_SCRIPT" add_edge research-wiki/ --from "exp:<id>" --to "claim:<cid>" --type supports --evidence "<metric>"
        elif verdict == "partial":
            Update claim page: status → partial
            [ -n "$WIKI_SCRIPT" ] && python3 "$WIKI_SCRIPT" add_edge research-wiki/ --from "exp:<id>" --to "claim:<cid>" --type supports --evidence "partial"
        else:
            Update claim page: status → invalidated
            [ -n "$WIKI_SCRIPT" ] && python3 "$WIKI_SCRIPT" add_edge research-wiki/ --from "exp:<id>" --to "claim:<cid>" --type invalidates --evidence "<why>"

    # 3. Update idea outcome (raw markdown, helper-free)
    Update research-wiki/ideas/<idea_id>.md:
      - outcome: positive | mixed | negative
      - If negative: fill "Failure / Risk Notes" and "Lessons Learned"
      - If positive: fill "Actual Outcome" and "Reusable Components"

    # 4. Rebuild + log (only if $WIKI_SCRIPT resolved)
    [ -n "$WIKI_SCRIPT" ] && python3 "$WIKI_SCRIPT" rebuild_query_pack research-wiki/
    [ -n "$WIKI_SCRIPT" ] && python3 "$WIKI_SCRIPT" log research-wiki/ "result-to-claim: exp:<id> verdict=<verdict> for idea:<idea_id>"

    # 5. Re-ideation suggestion
    Count failed/partial ideas since last /idea-creator run.
    If >= 3: print "💡 3+ ideas tested since last ideation. Consider re-running /idea-creator — the wiki now knows what doesn't work."

Rules¶

Codex is the judge, not CC. CC collects evidence and routes; Codex evaluates. This prevents post-hoc rationalization.
Do not inflate claims beyond what the data supports. If Codex says "partial", do not round up to "yes".
A single positive result on one dataset does not support a general claim. Be honest about scope.
If confidence is low, treat the judgment as inconclusive and add experiments rather than committing to a claim.
If Codex MCP is unavailable (call fails), CC makes its own judgment and marks it [pending Codex review] — do not block the pipeline.
Always record the verdict and reasoning in findings.md, regardless of outcome.

Review Tracing¶

After each mcp__codex__codex or mcp__codex__codex-reply reviewer call, save the trace following shared-references/review-tracing.md (Policy C — forensic; never silently skip). Use save_trace.sh (resolved per the chain in shared-references/integration-contract.md §2) or write files directly to .aris/traces/<skill>/<date>_run<NN>/. Respect the --- trace: parameter (default: full).

result-to-claim¶

Result-to-Claim Gate¶

Context: $ARGUMENTS¶

When to Use¶

Workflow¶

Step 1: Collect Results¶

Step 2: Codex Judgment¶

Step 3: Parse and Normalize¶

Step 3.5: Check Experiment Integrity (if audit exists)¶

Step 4: Route Based on Verdict¶

no — Claim not supported¶

partial — Claim partially supported¶

yes — Claim supported¶

Step 5: Update Research Wiki (if active)¶

Rules¶

Review Tracing¶

`result-to-claim`¶

`no` — Claim not supported¶

`partial` — Claim partially supported¶

`yes` — Claim supported¶