Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions METHODOLOGY_REVIEW.md
Original file line number Diff line number Diff line change
Expand Up @@ -588,14 +588,14 @@ and covariate-adjusted specifications.)

**Documentation in place:**
- REGISTRY.md section: `## TwoStageDiD` (Stage 1 unit+time FE on untreated, Stage 2 OLS on residualized outcomes, GMM sandwich variance per Newey-McFadden Theorem 6.1)
- Paper review: `docs/methodology/papers/gardner-2022-review.md` (PR-A — eq./section-numbered review of arXiv:2207.05943; corrected a fabricated Eq. 6 variance deviation, see "Documented alignment" below)
- Implementation: 76 unit tests in `tests/test_two_stage.py` (matches ImputationDiD point estimates, R `did2s` global `(D'D)^{-1}` variance, always-treated unit exclusion, multiplier bootstrap)
- Documented R alignment: uses global `(D'D)^{-1}` matching `did2s` (not paper Eq. 6)
- Documented alignment: variance = global `(D'D)^{-1}` GMM sandwich (Newey-McFadden Theorem 6.1, Gardner §3.3) — **faithful to both the paper and `did2s`**. Gardner eq. (6) is the *event-study regression spec*, not a variance formula; the earlier "matches `did2s`, not paper Eq. 6" / "Newey-McFadden sandwich vs paper's Eq. 6 deviation" framing was a misattribution, corrected in PR-A across `REGISTRY.md` + the paper review.

**Outstanding for promotion:**
- Dedicated `tests/test_methodology_two_stage.py` with paper-equation-numbered Verified Components walk-through
- R parity benchmark fixture against `did2s` (none on file)
- Documented deviation: Newey-McFadden Theorem 6.1 sandwich vs paper's Eq. 6 (already noted in REGISTRY but not formalized in this tracker)
- "Corrections Made" listing
- "Corrections Made" listing + flip Status → Complete (PR-B)

---

Expand Down Expand Up @@ -1444,10 +1444,10 @@ more graceful handling of edge cases while still signaling invalid inference to

Promotion priority for the **In Progress** entries, ordered by what's blocked on substantive review work (top of list = needs review next) vs. consolidation pass (bottom of list = mostly tracker walk-through):

**Substantive-review-blocked (still missing a methodology test file / R parity and a paper review):**
**Substantive-review-blocked (each still missing one or more of: a methodology test file, R parity, or a paper review):**

1. **PlaceboTests** — decide first whether to keep standalone or absorb into per-estimator diagnostic sections; methodologically lightweight either way.
2. **TwoStageDiD** — the remaining half of the imputation pair (ImputationDiD is now Complete, validated against `didimputation`). Needs a Gardner (2022) paper review, `tests/test_methodology_two_stage.py`, and an R parity fixture against `did2s`.
2. **TwoStageDiD** — the remaining half of the imputation pair (ImputationDiD is now Complete, validated against `didimputation`). Gardner (2022) paper review **landed** (`docs/methodology/papers/gardner-2022-review.md`, PR-A); still needs `tests/test_methodology_two_stage.py` and an R parity fixture against `did2s` to flip to Complete (PR-B).

**Consolidation-pass-blocked (already has paper review or methodology file or R parity; mostly Verified Components walk-through):**

Expand Down
1 change: 1 addition & 0 deletions TODO.md
Original file line number Diff line number Diff line change
Expand Up @@ -173,6 +173,7 @@ Deferred items from PR reviews that were not addressed before merge.
|-------|----------|----|----------|
| Drift test for tutorial 24 qualitative power claims (monotonic dilution fast→slow; CS-vs-2×2 MDE crossover/near-parity at slow rollout) — pins the prose against estimator-default/simulation drift | `docs/tutorials/24_staggered_vs_collapsed_power.ipynb` | staggered-analysis-2x2 | Low |
| ImputationDiD covariate-path variance lacks dedicated R `didimputation` parity / hand-calc. The PR-B FE-design correction (keep all unit dummies) affects the covariate projection too, but only the no-covariate staggered panel is R-parity'd (the covariate path shares the same validated projection code and passes the full suite). Add a covariate (time-varying X) R golden asserting overall/event-study SE parity, or a small dense-design hand-calc for the covariate projection. | `tests/test_methodology_imputation.py`, `benchmarks/R/generate_didimputation_golden.R` | imputation-validation follow-up | Low |
| TwoStageDiD methodology validation PR-B: add `tests/test_methodology_two_stage.py` (eq./section-numbered Verified Components — Stage-1 FE recovery on untreated obs; Stage-2 overall ATT eq. 4 + event-study eq. 6; GMM first-stage-correction behavior; always-treated drop) + `did2s` R parity fixture (`benchmarks/R/generate_did2s_golden.R` + `benchmarks/data/did2s_golden.json` + `did2s_test_panel.csv`); then flip `METHODOLOGY_REVIEW.md` TwoStageDiD row In Progress → Complete. PR-A (paper review `gardner-2022-review.md`) merged separately. | `tests/test_methodology_two_stage.py`, `benchmarks/`, `METHODOLOGY_REVIEW.md` | two-stage-validation PR-B | Medium |
| Port the CI `<notebook-prose>` extraction into the reviewer-eval harness so `docs/tutorials/*.ipynb` cases (currently guarded out of `verify-corpus`/`run`) can be reviewed with CI-equivalent context | `tools/reviewer-eval/adapters/ci_prompt.py` | local-review | Low |
| **Premise corrected — no CI impact (verified 2026-06-07).** The "slow CI" motivation does not hold: no CI workflow installs R (no `setup-r` / `r-lib/actions` / `fixest` / `r-base` install anywhere in `.github/workflows/`), so every R-parity test skips in CI behind a per-file availability gate (`fixest_available` in twfe, `_check_r_contdid()` in continuous_did, `require_r` / `r_available` in `conftest.py`, etc.) — consolidating `Rscript` spawns yields zero CI speedup. The originally-cited file already session-caches its R fits: `test_methodology_twfe.py` exposes `r_twfe_results` / `r_twfe_results_with_covariate` as `scope="session"` fixtures, so each R model runs once per session, not once per test. The only residual is a LOCAL-dev micro-optimization for developers who have R installed: `test_methodology_continuous_did.py` (the `_run_r_contdid` helper plus three standalone inline `Rscript` calls) and `test_methodology_callaway.py` (`_run_r_estimation` called inline in three test methods, plus `_get_r_mpdta_and_results` re-run by the MPDTA R-parity tests) re-spawn `library(...)` per call with no session-level result cache. Applying the twfe session-fixture pattern there would speed local R-parity runs only. Low value; retained as a local-dev note. | `tests/test_methodology_continuous_did.py`, `tests/test_methodology_callaway.py` | #139 | Low |
| CS R helpers hard-code `xformla = ~ 1`; no covariate-adjusted R benchmark for IRLS path | `tests/test_methodology_callaway.py` | #202 | Low |
Expand Down
7 changes: 4 additions & 3 deletions diff_diff/two_stage_results.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,9 +25,10 @@ class TwoStageBootstrapResults:
Results from TwoStageDiD bootstrap inference.

Bootstrap uses multiplier bootstrap on the GMM influence function,
consistent with other library estimators. The R `did2s` package uses
block bootstrap by default; multiplier bootstrap is asymptotically
equivalent.
consistent with other library estimators. The R `did2s` package defaults
to analytical corrected clustered SEs (``bootstrap = FALSE``); its optional
block bootstrap (``bootstrap = TRUE``) and this multiplier bootstrap are
asymptotically equivalent.

Attributes
----------
Expand Down
2 changes: 2 additions & 0 deletions docs/doc-deps.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -252,6 +252,8 @@ sources:
- path: docs/methodology/REGISTRY.md
section: "TwoStageDiD"
type: methodology
- path: docs/methodology/papers/gardner-2022-review.md
type: methodology
- path: docs/api/two_stage.rst
type: api_reference
- path: docs/tutorials/12_two_stage_did.ipynb
Expand Down
12 changes: 6 additions & 6 deletions docs/methodology/REGISTRY.md
Original file line number Diff line number Diff line change
Expand Up @@ -1395,28 +1395,28 @@ Point estimates are identical to ImputationDiD (Borusyak et al. 2024). The two-s
The variance accounts for first-stage estimation error propagating into Stage 2, following the GMM framework:

```
V(tau_hat) = (D'D)^{-1} * Bread * (D'D)^{-1}
V(tau_hat) = (D'D)^{-1} * Meat * (D'D)^{-1} [(D'D)^{-1} = GLOBAL GMM bread (Jacobian inverse)]

Bread = sum_c ( sum_{i in c} psi_i )( sum_{i in c} psi_i )'
Meat = sum_c ( sum_{i in c} psi_i )( sum_{i in c} psi_i )' [score outer-product, clustered at unit]
```

where `psi_i` is the stacked influence function for unit i across all its observations, combining the Stage 2 score and the Stage 1 correction term.

**Note on Equation 6 discrepancy:** The paper's Equation 6 uses a per-cluster inverse `(D_c'D_c)^{-1}` when forming the influence function contribution. The R `did2s` implementation and our code use the GLOBAL inverse `(D'D)^{-1}` following standard GMM theory (Newey & McFadden 1994). We follow the R implementation, which is consistent with standard GMM sandwich variance estimation.
**Variance is faithful to the paper (global Jacobian inverse).** Gardner (2022) §3.3 derives the variance by reading the two stages as a joint GMM estimator (Hansen 1982) and applying Newey & McFadden (1994) Theorem 6.1: `v` is the last element of `E[∂f/∂(λ,γ,β)]^{-1} E[ff'] E[∂f/∂(λ,γ,β)]^{-1'}` — the **global** Jacobian inverse (the `(D'D)^{-1}` bread above), with the score outer-product `E[ff']` clustered at the unit per the reference Stata GMM `vce(cluster id)` (Appendix B). Our global `(D'D)^{-1}` bread + unit-clustered meat **matches** this and the R `did2s` implementation; there is **no** per-cluster inverse. (Equation (6) in the paper is the *event-study regression specification*, not a variance formula — an earlier "Equation 6 per-cluster inverse `(D_c'D_c)^{-1}`" note was a misattribution, corrected per `docs/methodology/papers/gardner-2022-review.md`.)

**No finite-sample adjustments:** The variance estimator uses the raw asymptotic sandwich without degrees-of-freedom corrections (no HC1-style `n/(n-k)` adjustment). This matches the R `did2s` implementation.

*Bootstrap:*

Our implementation uses multiplier bootstrap on the GMM influence function: cluster-level `psi` sums are pre-computed, then perturbed with multiplier weights (Rademacher by default; configurable via `bootstrap_weights` parameter to use Mammen or Webb weights, matching CallawaySantAnna). The R `did2s` package defaults to block bootstrap (resampling clusters with replacement). Both approaches are asymptotically valid; the multiplier bootstrap is computationally cheaper and consistent with the CallawaySantAnna/ImputationDiD bootstrap patterns in this library.
Our implementation uses multiplier bootstrap on the GMM influence function: cluster-level `psi` sums are pre-computed, then perturbed with multiplier weights (Rademacher by default; configurable via `bootstrap_weights` parameter to use Mammen or Webb weights, matching CallawaySantAnna). The R `did2s` package **defaults to analytical corrected clustered SEs** (`bootstrap = FALSE`, the same GMM sandwich); its block bootstrap is *optional* (`bootstrap = TRUE`, resampling clusters with replacement). All approaches are asymptotically valid; the multiplier bootstrap is computationally cheaper and consistent with the CallawaySantAnna/ImputationDiD bootstrap patterns in this library.

*Edge cases:*
- **Always-treated units:** Units treated in all observed periods have no untreated observations for Stage 1 FE estimation. These are excluded with a warning listing the affected unit IDs. Their treated observations do NOT contribute to Stage 2.
- **Rank condition violations:** If the Stage 1 design matrix (unit+time dummies on untreated obs) is rank-deficient, or if certain unit/time FE are unidentified (e.g., a unit with no untreated periods after excluding always-treated), the affected FE produce NaN. Behavior controlled by `rank_deficient_action`: "warn" (default), "error", or "silent".
- **NaN y_tilde handling:** When Stage 1 FE are unidentified for some observations, the residualized outcome `y_tilde` is NaN. These observations are zeroed out (excluded) from the Stage 2 regression and variance computation, matching the treatment of unimputable observations in ImputationDiD.
- **NaN inference for undefined statistics:** t_stat uses NaN when SE is non-finite or zero; p_value and CI also NaN. Matches CallawaySantAnna/ImputationDiD NaN convention.
- **Event study aggregation:** Horizon-specific effects use the same two-stage procedure with horizon indicator dummies in Stage 2. Unidentified horizons (e.g., long-run effects without never-treated units, per Proposition 5 of Borusyak et al. 2024) produce NaN.
- **Pre-period event study coefficients (`pretrends=True`):** When enabled, the Stage 2 design matrix `X_2` includes pre-period relative-time dummies. Pre-period observations have `y_tilde = Step 1 residual` by construction. The GMM sandwich variance accounts for Stage 1 estimation error (Gardner 2022, Theorem 1). Only affects event study aggregation; overall ATT unchanged.
- **Pre-period event study coefficients (`pretrends=True`):** When enabled, the Stage 2 design matrix `X_2` includes pre-period relative-time dummies. Pre-period observations have `y_tilde = Step 1 residual` by construction. The GMM sandwich variance accounts for Stage 1 estimation error (Gardner 2022 §3.3; Newey-McFadden 1994, Theorem 6.1 — the paper has no numbered theorems). Only affects event study aggregation; overall ATT unchanged.
- **balance_e with no qualifying cohorts:** If no cohorts have sufficient pre/post coverage for the requested `balance_e`, a warning is emitted and event study results contain only the reference period.
- **No never-treated units (Proposition 5):** When there are no never-treated units and multiple treatment cohorts, horizons h >= h_bar (where h_bar = max(groups) - min(groups)) are unidentified per Proposition 5 of Borusyak et al. (2024). These produce NaN inference with n_obs > 0 (treated observations exist but counterfactual is unidentified) and a warning listing affected horizons. Matches ImputationDiD behavior. Proposition 5 applies to event study horizons only, not cohort aggregation — a cohort whose treated obs all fall at Prop 5 horizons naturally gets n_obs=0 in group effects because all its y_tilde values are NaN.
- **Zero-observation horizons after filtering:** When `balance_e` or NaN `y_tilde` filtering results in zero observations for some non-Prop-5 event study horizons, those horizons produce NaN for all inference fields (effect, SE, t-stat, p-value, CI) with n_obs=0.
Expand All @@ -1436,7 +1436,7 @@ Our implementation uses multiplier bootstrap on the GMM influence function: clus
- [x] Stage 2: Regress residualized outcomes on treatment indicators
- [x] Point estimates match ImputationDiD
- [x] GMM sandwich variance (Newey & McFadden 1994 Theorem 6.1)
- [x] Global `(D'D)^{-1}` in variance (matches R `did2s`, not paper Eq. 6)
- [x] Global `(D'D)^{-1}` in variance (faithful to Gardner §3.3 / Newey-McFadden GMM sandwich; matches R `did2s`)
- [x] No finite-sample adjustment (raw asymptotic sandwich)
- [x] Always-treated units excluded with warning
- [x] Multiplier bootstrap on GMM influence function
Expand Down
Loading