`technical-review`¶

Pack: 100xOS shared skills

Category: review

Field: economics

License: private (curator-owned)

Updated: 2026-05-20

Stages: referee-simulation

Curator-private skill — copy text from 100xOS/shared/skills/review/technical-review.md.

↗ view SKILL.md on source

Technical Review¶

Purpose¶

This skill describes how to conduct a technical/methodological review of a research paper, focusing on the internal coherence of the data-to-results pipeline. Unlike a referee report (which evaluates contribution and positioning), the technical review asks: does the implementation actually do what the paper claims it does?

Step 1: Data Pipeline Verification¶

Trace the data from source to estimation sample:

SQL / data construction¶

Read the data queries. Do they construct the sample described in the paper?
Are exclusion criteria in the code the same as those described in the text?
Are variable definitions in the code consistent with the data description?
Does the time period in the query match the stated time period?

Sample construction¶

How are missing values handled? Is it documented?
Are there implicit filters (e.g., inner joins that drop observations)?
Is the unit of observation what the paper claims? (e.g., firm-quarter, address-day, protocol-week)
Are there duplicates? Is deduplication documented?

Variable definitions¶

Are continuous variables winsorized or trimmed? At what level?
Are categorical variables grouped appropriately?
Are log transformations applied where claimed? What about zeros?
Do variable names in the code match the variable names in the paper?

Red flags: - Unexplained sample size differences between data description and estimation - Variables defined differently in different scripts - Hardcoded filter values without documentation (e.g., WHERE value > 1000 without explaining why 1000)

Step 2: Estimation Code-to-Paper Alignment¶

Compare the estimation code against the stated econometric specification:

Specification match¶

Write out the estimating equation from the paper
Write out what the code actually estimates
Are they the same? Check:
Dependent variable
Independent variables (treatment, controls)
Fixed effects
Interaction terms
Sample restrictions

Standard errors¶

What level are standard errors clustered at in the code?
Is this the level described in the paper?
Is the number of clusters reported? Is it sufficient (>= 30)?
If robust (HC) errors are used, which variant? (HC0, HC1, HC2, HC3)

Fixed effects¶

Which fixed effects are included in the code?
Do they match what the paper reports?
Are they collinear with the treatment variable? (This would absorb the effect of interest.)

Functional form¶

Level vs. log specification: does the code match the paper?
Are polynomials or splines used as described?
For binary outcomes: logit/probit vs. linear probability model — consistent?

Step 3: Results Interpretation Check¶

Effect size interpretation¶

Is the coefficient interpreted in the correct units?
Log-level: a one-unit increase in X is associated with a (β×100)% change in Y
Log-log: a 1% increase in X is associated with a β% change in Y
Level-level: a one-unit increase in X is associated with a β-unit change in Y
Are economic magnitudes computed from the right baseline?
"A one-SD increase in X leads to..." — is the SD from the estimation sample or the full sample?

Significance and inference¶

Do the reported significance stars match the standard errors?
Check: |coefficient / SE| > 1.96 for 5% significance
Are confidence intervals correctly computed? (coef ± 1.96 × SE for 95%)
If multiple testing is an issue, is there a correction?

Robustness assessment¶

Are the robustness checks addressing the actual threats to identification listed in the paper?
Or are they "cosmetic" — e.g., adding an irrelevant control, changing a bandwidth that doesn't matter?
A good robustness check directly tests a specific alternative explanation. Does each reported check do this?

Step 4: Red Flag Detection¶

P-value patterns¶

Count the p-values in the paper. Is there unusual clustering just below conventional thresholds (0.05, 0.01, 0.10)?
This is not proof of p-hacking, but a pattern worth noting.

Specification searching¶

Are there signs that many specifications were tried but only favorable ones reported?
Does the paper report a "preferred specification" without justifying why it's preferred?
Are there large jumps in coefficients across specifications? This suggests sensitivity to choices.

Selective reporting¶

Are there variables in the estimation code that don't appear in any table?
Are there specifications in the code that aren't reported?
Is there a "table graveyard" — tables mentioned in comments but not in the paper?

Magic numbers¶

Are there hardcoded values in the estimation or analysis code?
Examples: if p_value < 0.05, threshold = 100, winsorize_level = 0.01
These should be documented or parameterized.

Suspiciously clean results¶

All coefficients significant with the same sign
R-squared suspiciously high for the type of regression
Perfect monotonicity in heterogeneity results
Zero standard errors or perfect prediction

Step 5: Internal Consistency Across Pipeline Stages¶

The most common source of technical problems is inconsistency between pipeline stages:

Check	What to verify
Data → Estimation	Does the estimation code read the data the data stage produced? Same variables, same sample?
Estimation → Analysis	Does the analysis script use the estimation results correctly?
Analysis → Draft	Does the draft accurately report what the analysis found?
RSD → Estimation	Does the estimation implement the strategy described in the Research Strategy Document?
RSD → Draft	Does the paper claim to do what the RSD says?

Output Structure¶

Text Only

## Technical Review

### Summary Assessment
[Pass / Concerns / Fail — with one-paragraph justification]

### Data Pipeline
[Issues found, or confirmation that pipeline is sound]

### Estimation Implementation
[Code-to-paper alignment issues, or confirmation of alignment]

### Results Reporting
[Interpretation issues, significance checks, robustness assessment]

### Red Flags
[Any red flags detected, or "None detected"]

### Strengths
[What the implementation does well — be specific]

### Recommendations
[Specific actions to address each issue, mapped to pipeline stages]

technical-review¶