Skip to content

technical-review

Category: review
Field: economics
License: private (curator-owned)
Updated: 2026-05-20
Stages: referee-simulation

Curator-private skill — copy text from 100xOS/shared/skills/review/technical-review.md.

Technical Review

Purpose

This skill describes how to conduct a technical/methodological review of a research paper, focusing on the internal coherence of the data-to-results pipeline. Unlike a referee report (which evaluates contribution and positioning), the technical review asks: does the implementation actually do what the paper claims it does?


Step 1: Data Pipeline Verification

Trace the data from source to estimation sample:

SQL / data construction

  • Read the data queries. Do they construct the sample described in the paper?
  • Are exclusion criteria in the code the same as those described in the text?
  • Are variable definitions in the code consistent with the data description?
  • Does the time period in the query match the stated time period?

Sample construction

  • How are missing values handled? Is it documented?
  • Are there implicit filters (e.g., inner joins that drop observations)?
  • Is the unit of observation what the paper claims? (e.g., firm-quarter, address-day, protocol-week)
  • Are there duplicates? Is deduplication documented?

Variable definitions

  • Are continuous variables winsorized or trimmed? At what level?
  • Are categorical variables grouped appropriately?
  • Are log transformations applied where claimed? What about zeros?
  • Do variable names in the code match the variable names in the paper?

Red flags: - Unexplained sample size differences between data description and estimation - Variables defined differently in different scripts - Hardcoded filter values without documentation (e.g., WHERE value > 1000 without explaining why 1000)


Step 2: Estimation Code-to-Paper Alignment

Compare the estimation code against the stated econometric specification:

Specification match

  • Write out the estimating equation from the paper
  • Write out what the code actually estimates
  • Are they the same? Check:
  • Dependent variable
  • Independent variables (treatment, controls)
  • Fixed effects
  • Interaction terms
  • Sample restrictions

Standard errors

  • What level are standard errors clustered at in the code?
  • Is this the level described in the paper?
  • Is the number of clusters reported? Is it sufficient (>= 30)?
  • If robust (HC) errors are used, which variant? (HC0, HC1, HC2, HC3)

Fixed effects

  • Which fixed effects are included in the code?
  • Do they match what the paper reports?
  • Are they collinear with the treatment variable? (This would absorb the effect of interest.)

Functional form

  • Level vs. log specification: does the code match the paper?
  • Are polynomials or splines used as described?
  • For binary outcomes: logit/probit vs. linear probability model — consistent?

Step 3: Results Interpretation Check

Effect size interpretation

  • Is the coefficient interpreted in the correct units?
  • Log-level: a one-unit increase in X is associated with a (β×100)% change in Y
  • Log-log: a 1% increase in X is associated with a β% change in Y
  • Level-level: a one-unit increase in X is associated with a β-unit change in Y
  • Are economic magnitudes computed from the right baseline?
  • "A one-SD increase in X leads to..." — is the SD from the estimation sample or the full sample?

Significance and inference

  • Do the reported significance stars match the standard errors?
  • Check: |coefficient / SE| > 1.96 for 5% significance
  • Are confidence intervals correctly computed? (coef ± 1.96 × SE for 95%)
  • If multiple testing is an issue, is there a correction?

Robustness assessment

  • Are the robustness checks addressing the actual threats to identification listed in the paper?
  • Or are they "cosmetic" — e.g., adding an irrelevant control, changing a bandwidth that doesn't matter?
  • A good robustness check directly tests a specific alternative explanation. Does each reported check do this?

Step 4: Red Flag Detection

P-value patterns

  • Count the p-values in the paper. Is there unusual clustering just below conventional thresholds (0.05, 0.01, 0.10)?
  • This is not proof of p-hacking, but a pattern worth noting.

Specification searching

  • Are there signs that many specifications were tried but only favorable ones reported?
  • Does the paper report a "preferred specification" without justifying why it's preferred?
  • Are there large jumps in coefficients across specifications? This suggests sensitivity to choices.

Selective reporting

  • Are there variables in the estimation code that don't appear in any table?
  • Are there specifications in the code that aren't reported?
  • Is there a "table graveyard" — tables mentioned in comments but not in the paper?

Magic numbers

  • Are there hardcoded values in the estimation or analysis code?
  • Examples: if p_value < 0.05, threshold = 100, winsorize_level = 0.01
  • These should be documented or parameterized.

Suspiciously clean results

  • All coefficients significant with the same sign
  • R-squared suspiciously high for the type of regression
  • Perfect monotonicity in heterogeneity results
  • Zero standard errors or perfect prediction

Step 5: Internal Consistency Across Pipeline Stages

The most common source of technical problems is inconsistency between pipeline stages:

Check What to verify
Data → Estimation Does the estimation code read the data the data stage produced? Same variables, same sample?
Estimation → Analysis Does the analysis script use the estimation results correctly?
Analysis → Draft Does the draft accurately report what the analysis found?
RSD → Estimation Does the estimation implement the strategy described in the Research Strategy Document?
RSD → Draft Does the paper claim to do what the RSD says?

Output Structure

Text Only
## Technical Review

### Summary Assessment
[Pass / Concerns / Fail — with one-paragraph justification]

### Data Pipeline
[Issues found, or confirmation that pipeline is sound]

### Estimation Implementation
[Code-to-paper alignment issues, or confirmation of alignment]

### Results Reporting
[Interpretation issues, significance checks, robustness assessment]

### Red Flags
[Any red flags detected, or "None detected"]

### Strengths
[What the implementation does well — be specific]

### Recommendations
[Specific actions to address each issue, mapped to pipeline stages]