`comparison-evaluation`¶

Pack: 100xOS shared skills

Category: replication

Field: economics

License: private (curator-owned)

Updated: 2026-05-20

Stages: replication

Curator-private skill — copy text from 100xOS/shared/skills/replication/comparison-evaluation.md.

↗ view SKILL.md on source

Skill: Comparison & Evaluation¶

You are evaluating whether a replication successfully reproduces the original results.

Match Quality Categories¶

Direction and Significance Match¶

Same sign of coefficient AND same significance level
This is a "successful replication" even if magnitudes differ
Deviation in magnitude is expected with different data/period

Direction Only Match¶

Same sign but different significance (e.g., *** becomes * or n.s.)
This is a "partial replication" — the effect exists but is weaker
Investigate: smaller sample? less variation? different period?

No Match¶

Different sign OR completely insignificant when original was highly significant
This is a "failed replication" — requires careful analysis
Do NOT immediately blame the replication — the original might be fragile

Deviation Analysis Framework¶

Expected Deviations (planned)¶

Different sample period → expect magnitude differences
Different winsorization → affects outlier-sensitive estimates
Different data source → expect level differences, same patterns
Package differences → minor numerical differences (< 1%)

Unexpected Deviations (investigate)¶

Sign flips → check variable construction, coding errors
Large magnitude differences (> 50%) → check sample selection
Significance changes → check SE computation, clustering

Assumption Testing¶

For each key assumption: 1. Can it be tested? (parallel trends → yes; exclusion restriction → usually no) 2. Was it tested in the original? How? 3. Does it hold in the replication data? 4. If violated, what does that imply for the results?

Overall Assessment Criteria¶

Successful: >80% of primary results replicate in direction and significance
Partial: 50-80% of primary results replicate, or all replicate in direction only
Failed: <50% of primary results replicate

Robustness Assessment of Original¶

After all comparisons: - Are results robust to different data? → Strong external validity - Are results robust to different specifications? → Strong internal validity - Do results depend on specific coding choices? → Fragile

comparison-evaluation¶