/audit-reproducibility¶
replicationMITreplicationAudit Reproducibility¶
Compare numeric claims in a manuscript (point estimates, standard errors, p-values, counts) against the actual outputs produced by the analysis pipeline. Report PASS / FAIL per claim against the tolerance thresholds defined in .claude/rules/replication-protocol.md.
Core principle: If the paper says ATT = -1.632 (0.584) and the code produces -1.628 (0.591), we verify — numerically — that the difference is within the documented tolerance. No more "looks close enough" eyeballing.
When to use¶
- Before submission. Catches the "I updated the analysis but forgot to update Table 2" bug.
- Before releasing a replication package. Verifies the code actually reproduces the paper.
- After a major revision. Ensures the paper still matches the latest code.
- Quality-gate in
/commit. Pair with a pre-commit invocation on manuscript + analysis changes.
Inputs¶
$0— path to the manuscript (.tex,.qmd,.md,.pdf). Required.$1— path to the outputs directory. Defaults toscripts/R/_outputs/. Can be_targets/objects/, a Stata.do-file log directory, etc.
Workflow¶
Phase 0: Pre-flight¶
- Read
replication-protocol.mdfor the tolerance thresholds currently in effect. - Verify the outputs directory exists and is non-empty. If empty or stale (older than the manuscript), prompt the user to re-run their pipeline (e.g.,
Rscript scripts/R/00_run_all.R) before auditing. - Ensure a
sessionInfo.txtor equivalent environment capture exists in the outputs dir.
Phase 1: Extract claims from the manuscript¶
Parse the manuscript for numeric claims. Patterns to match:
- Point-estimate + SE:
ATT = -1.632 (0.584),$\beta = 0.342$ (0.091),hat{\tau} = 1.28**with starred significance - Table cells:
& -1.632$^{***}$ & 0.584 &in LaTeX table environments - Counts:
our sample of 2,847 firms,$N = 2{,}847$ - Summary stats:
mean = 0.423,SD = 0.087 - P-values:
p < 0.01,$p = 0.003$
Record each claim as a tuple:
{
claim_id: "Table2_col3_ATT",
location: "Table 2, Column 3, row 'Treatment'",
kind: "point_estimate" | "standard_error" | "p_value" | "count" | "percentage",
reported_value: -1.632,
uncertainty: 0.584, # only for point estimates
significance_stars: 3, # 0-3 or None
raw_context: "the ATT estimate of -1.632 (0.584) indicates..."
}
Write the extracted claims to quality_reports/reproducibility_claims_[manuscript-name].json so the user can review the extraction before audit.
Phase 2: Extract results from outputs¶
Scan $1 for corresponding values. Priority order:
.rdsfiles —readRDS(path)$coef[["treatment"]]style lookups. Can useRscript -e "saveRDS(summary(readRDS(...)), '/tmp/audit.rds')"to extract..textables — parse LaTeX table cells directly; match on column headers + row labels..csvsummary files — pandas/readr parse, key-value lookup..out/.logfiles (Stata, regress output) — regex extraction..json— direct key lookup.
Record each extracted result:
{
source: "scripts/R/_outputs/results.rds",
lookup_key: "fit_main$coefficients['treated']",
value: -1.628,
uncertainty: 0.591,
p_value: 0.005
}
Phase 3: Match claims to results¶
Use fuzzy heuristics when exact labels don't match:
- Name similarity (
"treatment effect"~"ATT"~"treated") - Magnitude similarity (if two candidates have values within 10% of the reported, prefer the one with closer SE)
- Context hints from the claim's
raw_contextfield (table number, row label, description)
For every claim, produce a match candidate with a confidence score. Claims below 0.7 confidence get flagged as "UNMATCHED — manual review needed" rather than silently passing.
Phase 4: Tolerance check¶
For each matched claim, apply the thresholds from replication-protocol.md:
| Kind | Tolerance | Example |
|---|---|---|
| Integers (N, counts) | Exact | 2,847 must equal 2,847 |
| Point estimates | abs(reported - computed) < 0.01 |
-1.632 vs -1.628 → diff = 0.004 → PASS |
| Standard errors | abs(reported - computed) < 0.05 |
0.584 vs 0.591 → diff = 0.007 → PASS |
| P-values | Same significance level | p<0.01 and p<0.01 → PASS; p<0.01 and p=0.03 → FAIL |
| Percentages | ±0.1pp | 42.3% vs 42.35% → PASS |
Respect any tolerance overrides the user has written into their replication-protocol.md fork (they may loosen for MC noise or tighten for administrative data).
Phase 5: Report¶
Write quality_reports/reproducibility_audit_[manuscript-name].md:
## Reproducibility Audit: [Manuscript Title]
**Date:** [YYYY-MM-DD]
**Manuscript:** [path]
**Outputs directory:** [path]
**Tolerance source:** .claude/rules/replication-protocol.md
### Summary
| Status | Count |
|---|---|
| PASS | N |
| FAIL (diff > tolerance) | M |
| UNMATCHED (manual review) | K |
| **Overall verdict** | **PASS / FAIL** |
### PASS (all within tolerance)
| Claim | Reported | Computed | Diff | Tolerance |
|---|---|---|---|---|
| Table2_col3_ATT | -1.632 (0.584) | -1.628 (0.591) | 0.004 / 0.007 | 0.01 / 0.05 |
### FAIL (outside tolerance — BLOCKER)
| Claim | Reported | Computed | Diff | Tolerance | Location in paper |
|---|---|---|---|---|---|
### UNMATCHED (manual review)
| Claim | Raw context | Candidate sources |
|---|---|---|
### Environment
[sessionInfo excerpt]
### Next steps
1. Fix any FAIL rows — either update the manuscript or rerun analysis.
2. Review UNMATCHED rows — add explicit lookup keys or widen the search scope.
3. After zero FAILs, the paper is replication-ready.
Exit behavior¶
- All PASS: exit 0, summary printed.
- Any FAIL: exit 1, summary printed to stderr. This makes the skill usable as a
/commitpre-commit gate — seereplication-protocol.mdfor the enforcement pattern. - UNMATCHED > 0 (with 0 FAIL): exit 0 with warning — user must manually review.
Cross-references¶
.claude/rules/replication-protocol.md— the tolerance contract..claude/skills/review-r/SKILL.md— catches code-style issues; this skill catches NUMERICAL reproducibility..claude/skills/review-paper/SKILL.md— content review; pair with this skill for a full pre-submission audit.
What this skill does NOT do¶
- Re-run your analysis. The skill compares CURRENT outputs against manuscript claims. If the outputs are stale, re-run your pipeline first (the pre-flight phase will warn).
- Catch wrong specifications. A regression that compiles cleanly and produces a reproducible
-1.632is reproducible. Whether-1.632is the RIGHT estimand is areview-paper/ domain-reviewer question. - Check external package versions. The
sessionInfo.txtcapture lets a reviewer see the env; pinning versions is on the user (viarenv.lockor aDESCRIPTIONfile).
Long batch reruns: use the Monitor tool (Apr 2026)¶
When /audit-reproducibility is asked to verify all numeric claims in a paper, the safest approach is to re-run the full pipeline (00_run_all.R or equivalent) and compare the regenerated outputs to the manuscript values. For pipelines that take more than a couple of minutes, background-launch the rerun and use Anthropic's Monitor tool (Apr 2026 Week 15) to stream stdout. The audit can react to errors mid-stream rather than waiting for the entire pipeline to finish before noticing a failed step.