Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Added
- **`generate_synthetic_control_data()` data generator + a capstone `SyntheticControl` tutorial.** New public generator (`diff_diff/prep_dgp.py`, exported from `diff_diff`) builds a **single-treated-unit** factor-model panel for synthetic-control demos and tests: one treated unit whose latent factor loadings and baseline are an exact convex combination of a few donors (so the noiseless trajectory lies in the donor convex hull and a synthetic control reproduces it closely — the observed fit is approximate under added noise), persistent AR(1) factors, predictor covariates that each proxy a distinct factor, a common calendar time effect, and a known `"ramp"` or `"constant"` treatment effect emitted as `true_effect`. Tutorial **`docs/tutorials/25_synthetic_control_policy.ipynb`** walks the whole `SyntheticControl` surface end-to-end on a policy-evaluation story (one state adopts a clean-energy standard), structured around **two inference philosophies**: cross-unit permutation (`in_space_placebo` + Firpo–Possebom `confidence_set`, with `leave_one_out` / `in_time_placebo` robustness) versus over-time conformal (CWZ `conformal_test` / `conformal_confidence_intervals` / `conformal_average_effect`), with the per-period conformal band as the climax. A `tests/test_t25_synthetic_control_policy_drift.py` drift guard re-derives every quoted number from the generator.
- **`TwoStageDiD` methodology validation (Gardner 2022 / `did2s`).** New `tests/test_methodology_two_stage.py` with paper-equation-numbered Verified Components (§3 two-stage procedure / eqs. 4 & 6, §3.3 GMM variance, fn. 19 always-treated exclusion, Proposition 5, covariate path, `balance_e`, `vcov_type` narrowing) plus a `did2s::did2s()` cross-language parity fixture (`benchmarks/R/generate_did2s_golden.R` → `benchmarks/data/did2s_golden.json` + `did2s_test_panel.csv`), pinning overall + event-study ATT (abs 1e-6) and SE (abs 1e-7). `METHODOLOGY_REVIEW.md` `TwoStageDiD` row flipped to **Complete**.

### Fixed
- **`TwoStageDiD` analytical GMM standard errors are now exact (match R `did2s` to ~1e-7).** The Gardner two-stage GMM sandwich `_compute_gmm_variance` derived its residuals from the *iterative* alternating-projection first-stage fixed effects (`_iterative_fe`, which converge only to ~1e-7 on unbalanced untreated panels) while computing `gamma_hat` exactly — leaving the variance ~1% off the analytical sandwich. The variance now re-solves the Stage-1 FE **exactly** (sparse OLS, reusing the `gamma_hat` factorization), and `_build_fe_design` gained an intercept column so its column space spans the grand mean (the prior intercept-free design omitted it, and the exact residual is first-order sensitive to it). Unidentified-FE obs (rank-deficient / Proposition-5) fall back to the iterative residual, so those edge cases are unchanged; the reported `overall_att` still uses the iterative FE (point-estimate equivalence with `ImputationDiD` preserved). Mirrors the same-class fix already applied to `ImputationDiD`'s exact-sparse variance.
- **`LinearRegression.get_se()` / `get_inference()` no longer return a `NaN` standard error from a tiny-negative variance artifact.** A high-leverage / degenerate coefficient (e.g. an absorbed-FE dummy near-collinear with the treatment, whose Bell-McCaffrey Satterthwaite DOF already hits the noise-floor guard) can have a CR2/HC variance of ~0 (≈1e-32) whose vcov diagonal lands just-below-zero under BLAS-dependent float rounding; `np.sqrt` of the negative then produced a `NaN` SE **nondeterministically** — passing single-threaded but failing under the parallel pure-Python full-suite run (`tests/test_methodology_wls_cr2.py::TestLinearRegressionFENanGuardEndToEnd::test_did_absorbed_fe_lr_inference_nan_for_guarded_coefs`). Both SE sites now clamp the vcov diagonal at 0, so the SE is finite (0 for a genuinely-zero variance), deterministic, and BLAS-independent. **No change for any positive variance** (the clamp is a no-op there); only the previously-`NaN` degenerate case is affected.

## [3.5.2] - 2026-06-08
Expand Down
42 changes: 29 additions & 13 deletions METHODOLOGY_REVIEW.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ The catalog grew incrementally over several quarters, so formats vary across the
| SunAbraham | `sun_abraham.py` | `fixest::sunab()` | **Complete** | 2026-02-15 |
| StackedDiD | `stacked_did.py` | `stacked-did-weights` (Wing-Freedman-Hollingsworth code) | **Complete** | 2026-02-19 |
| ImputationDiD | `imputation.py` | `didimputation` | **Complete** | 2026-06-06 |
| TwoStageDiD | `two_stage.py` | `did2s` | **In Progress** | |
| TwoStageDiD | `two_stage.py` | `did2s` | **Complete** | 2026-06-23 |
| WooldridgeDiD (ETWFE) | `wooldridge.py` | `etwfe` (R) / `jwdid` (Stata) | **Complete** | 2026-05-22 |
| EfficientDiD | `efficient_did.py` | (no canonical R package) | **Complete** | 2026-06-01 |

Expand Down Expand Up @@ -583,19 +583,36 @@ and covariate-adjusted specifications.)
| Module | `two_stage.py`, `two_stage_bootstrap.py` |
| Primary Reference | Gardner (2022), *Two-stage differences in differences*, arXiv:2207.05943 |
| R Reference | `did2s` |
| Status | **In Progress** |
| Last Review | |
| Status | **Complete** |
| Last Review | 2026-06-23 |

**Documentation in place:**
- REGISTRY.md section: `## TwoStageDiD` (Stage 1 unit+time FE on untreated, Stage 2 OLS on residualized outcomes, GMM sandwich variance per Newey-McFadden Theorem 6.1)
- Paper review: `docs/methodology/papers/gardner-2022-review.md` (PR-A — eq./section-numbered review of arXiv:2207.05943; corrected a fabricated Eq. 6 variance deviation, see "Documented alignment" below)
- Implementation: 76 unit tests in `tests/test_two_stage.py` (matches ImputationDiD point estimates, R `did2s` global `(D'D)^{-1}` variance, always-treated unit exclusion, multiplier bootstrap)
- Documented alignment: variance = global `(D'D)^{-1}` GMM sandwich (Newey-McFadden Theorem 6.1, Gardner §3.3) — **faithful to both the paper and `did2s`**. Gardner eq. (6) is the *event-study regression spec*, not a variance formula; the earlier "matches `did2s`, not paper Eq. 6" / "Newey-McFadden sandwich vs paper's Eq. 6 deviation" framing was a misattribution, corrected in PR-A across `REGISTRY.md` + the paper review.
**Verified Components** (`tests/test_methodology_two_stage.py`, paper-section/equation-numbered):
- **§3, eqs. (4)/(6) — the two-stage procedure:** Stage 1 unit+time FE on the untreated set Ω₀, Stage 2 on the residualized outcome recovers the constant overall ATT (eq. 4) and heterogeneous-by-horizon effects (Step 2′ / eq. 6); perturbing a treated outcome shifts the overall ATT by exactly δ/N_treated (treated obs never feed Stage 1); coincides with `ImputationDiD` to 1e-10; the live `covariates=` (fn. 9 in-both-stages) path recovers the ATT under a treatment-confounded covariate — `TestGardner2022Section3TwoStageProcedure`
- **§3.3 + Appendix B — joint-GMM Newey-McFadden Thm 6.1 variance:** finite/positive SEs; the first-stage correction `γ̂'c_g` makes the GMM SE strictly exceed the no-first-stage floor; a fixed-seed SE regression-pin locks the global-inverse + no-FSA convention; `cluster=None` clusters at the unit; a rank-deficient Ω₀ warns and falls back to dense lstsq; `pretrends=True` leads have finite SEs — `TestGardner2022Section33GMMVariance`
- **fn. 19 + Proposition 5 (Borusyak et al. 2024):** always-treated units dropped with a warning (treated-unit count falls exactly); no never-treated ⇒ horizons h ≥ h̄ NaN with n_obs>0 + warning; `balance_e` with no qualifying cohort warns + reference-period-only; zero-obs cohorts NaN — `TestGardner2022Identification`
- **Library extensions:** multiplier bootstrap on the GMM influence function; `vcov_type ∈ {classical, hc2, hc2_bm}` rejected (no cross-stage hat matrix) — `TestGardner2022LibraryDeviations`
- **R parity vs `did2s::did2s` (v1.2.1):** overall + event-study ATT and SEs match on a fixed-seed staggered panel (analytical corrected clustered SE, `bootstrap=FALSE`); tests assert ATT `abs=1e-6`, SE `abs=1e-7` — `TestTwoStageDiDParityR` (goldens: `benchmarks/data/did2s_golden.json`, generator: `benchmarks/R/generate_did2s_golden.R`)

**Outstanding for promotion:**
- Dedicated `tests/test_methodology_two_stage.py` with paper-equation-numbered Verified Components walk-through
- R parity benchmark fixture against `did2s` (none on file)
- "Corrections Made" listing + flip Status → Complete (PR-B)
**Corrections Made** (PR-B, surfaced by the `did2s` R parity):
- **Exact Stage-1 residuals in the GMM variance.** `_compute_gmm_variance` derived its residuals from the *iterative* alternating-projection FE (`_iterative_fe`, ~1e-7 convergence on unbalanced Ω₀) while computing `gamma_hat` exactly — leaving the analytical SE ~1% off the exact sandwich. The variance now re-solves the Stage-1 FE **exactly** (sparse OLS, reusing the `gamma_hat` factorization); the reported `overall_att` still uses the iterative FE (twin-equivalence preserved at 1e-10). Unidentified-FE obs (rank-deficient / Prop-5) fall back to the iterative residual.
- **Intercept added to `_build_fe_design`.** The prior `[unit_1.., time_1..]` layout (drop first unit + first time, **no intercept**) did not span the constant / grand mean; the exact residual is first-order sensitive to it. Added an intercept column → standard full-rank two-way FE (matches fixest / `did2s`). Same-class fix as ImputationDiD's PR-B FE-design correction. SE now matches `did2s` to ~1e-9.

**Deviations from the reference / library extensions** (see `REGISTRY.md` `## TwoStageDiD`):
- **Deviation from R:** the `did2s` analytical GMM sandwich uses **no finite-sample multiplier** (meat `= S'S`); the rendered `CR1` label carries no Stata `(n-1)/(n-p)` or `G/(G-1)` factor (matches `did2s`; same FSA-free convention as ImputationDiD's Theorem-3 variance).
- Multiplier bootstrap on the GMM influence function (library extension; Gardner prescribes analytical GMM SEs only; `did2s` defaults `bootstrap=FALSE`).
- ⚠️ Paper-permitted but **not exposed**: the Eq. (5) P̄-period-average estimand and the fn. 8 full-sample first-stage variant (tracked in `TODO.md`).
- `vcov_type` narrowed to the GMM sandwich (`{classical, hc2, hc2_bm}` rejected); `vcov_type="conley"` deferred (TODO).

**R Comparison Results** (`did2s` v1.2.1, fixed-seed panel, `benchmarks/data/did2s_golden.json`):

| Quantity | Python | R | |Δ| |
|----------|--------|---|-----|
| Overall ATT | 2.04566790 | 2.04566803 | 1.3e-7 |
| Overall SE | 0.02813225 | 0.02813225 | 1.0e-9 |
| Event-study ATT (max) | — | — | 1.6e-7 |
| Event-study SE (max) | — | — | 3.7e-10 |

(Point-estimate Δ is iterative-FE convergence level; SE Δ is machine precision after the exact Stage-1 re-solve. The parity tests assert ATT `abs=1e-6`, SE `abs=1e-7` for cross-platform robustness.)

---

Expand Down Expand Up @@ -1447,7 +1464,6 @@ Promotion priority for the **In Progress** entries, ordered by what's blocked on
**Substantive-review-blocked (each still missing one or more of: a methodology test file, R parity, or a paper review):**

1. **PlaceboTests** — decide first whether to keep standalone or absorb into per-estimator diagnostic sections; methodologically lightweight either way.
2. **TwoStageDiD** — the remaining half of the imputation pair (ImputationDiD is now Complete, validated against `didimputation`). Gardner (2022) paper review **landed** (`docs/methodology/papers/gardner-2022-review.md`, PR-A); still needs `tests/test_methodology_two_stage.py` and an R parity fixture against `did2s` to flip to Complete (PR-B).

**Consolidation-pass-blocked (already has paper review or methodology file or R parity; mostly Verified Components walk-through):**

Expand Down
2 changes: 1 addition & 1 deletion TODO.md
Original file line number Diff line number Diff line change
Expand Up @@ -153,6 +153,7 @@ Deferred items from PR reviews that were not addressed before merge.
| `SpilloverDiD` data-driven `d_bar` selection (Butts 2021b / Butts 2023 JUE Insight cross-validation). | `spillover.py::SpilloverDiD` | follow-up | Low |
| `SpilloverDiD` sparse cKDTree path for the staggered nearest-treated-distance helper (mirrors the static helper's sparse branch). Currently `_compute_nearest_treated_distance_staggered` always builds dense `(n_units, n_treated_by_onset)` pairwise distance matrices per cohort; on large staggered panels with many cohorts this is avoidable memory/runtime. Add a sparse k-d-tree branch analogous to `_compute_nearest_treated_distance_sparse`, gated on `n > _CONLEY_SPARSE_N_THRESHOLD`. | `spillover.py::_compute_nearest_treated_distance_staggered` | follow-up (Wave B) | Low |
| `SpilloverDiDResults` in `DiagnosticReport` dispatch tables. Wave C event-study emits a TwoStageDiD-compatible `event_study_effects: Dict[int, Dict]` alias that `plot_event_study` consumes via the new `reference_period` attribute fallback in `_extract_plot_data`, but `SpilloverDiDResults` is NOT registered in `DiagnosticReport`'s `_APPLICABILITY` / `_PT_METHOD` tables — so `DiagnosticReport(spillover_result)` doesn't currently route to event-study diagnostics. Registering requires (a) deciding which diagnostics apply (parallel trends, pre-trends power, heterogeneity, design-effect) AND (b) adding an end-to-end test. | `diff_diff/diagnostic_report.py::_APPLICABILITY`, `_PT_METHOD` | follow-up (Wave C) | Low |
| `TwoStageDiD` paper-permitted estimand variants not exposed (Gardner 2022): the **Eq. (5) P̄-period-average** estimand (duration-restricted Stage-2 sample) and the **fn. 8 full-sample first-stage** variant (treatment-status×period interactions in Stage 1). Both are valid modifications described in the paper but have no public parameter (`get_params()` exposes neither; Stage 1 is always untreated-only). Documented as ⚠️ in `docs/methodology/papers/gardner-2022-review.md`; surface as `estimand=`/`first_stage=` options if a use case arises. | `diff_diff/two_stage.py` | two-stage-validation follow-up | Low |

#### Performance

Expand All @@ -173,7 +174,6 @@ Deferred items from PR reviews that were not addressed before merge.
|-------|----------|----|----------|
| Drift test for tutorial 24 qualitative power claims (monotonic dilution fast→slow; CS-vs-2×2 MDE crossover/near-parity at slow rollout) — pins the prose against estimator-default/simulation drift | `docs/tutorials/24_staggered_vs_collapsed_power.ipynb` | staggered-analysis-2x2 | Low |
| ImputationDiD covariate-path variance lacks dedicated R `didimputation` parity / hand-calc. The PR-B FE-design correction (keep all unit dummies) affects the covariate projection too, but only the no-covariate staggered panel is R-parity'd (the covariate path shares the same validated projection code and passes the full suite). Add a covariate (time-varying X) R golden asserting overall/event-study SE parity, or a small dense-design hand-calc for the covariate projection. | `tests/test_methodology_imputation.py`, `benchmarks/R/generate_didimputation_golden.R` | imputation-validation follow-up | Low |
| TwoStageDiD methodology validation PR-B: add `tests/test_methodology_two_stage.py` (eq./section-numbered Verified Components — Stage-1 FE recovery on untreated obs; Stage-2 overall ATT eq. 4 + event-study eq. 6; GMM first-stage-correction behavior; always-treated drop) + `did2s` R parity fixture (`benchmarks/R/generate_did2s_golden.R` + `benchmarks/data/did2s_golden.json` + `did2s_test_panel.csv`); then flip `METHODOLOGY_REVIEW.md` TwoStageDiD row In Progress → Complete. PR-A (paper review `gardner-2022-review.md`) merged separately. | `tests/test_methodology_two_stage.py`, `benchmarks/`, `METHODOLOGY_REVIEW.md` | two-stage-validation PR-B | Medium |
| Port the CI `<notebook-prose>` extraction into the reviewer-eval harness so `docs/tutorials/*.ipynb` cases (currently guarded out of `verify-corpus`/`run`) can be reviewed with CI-equivalent context | `tools/reviewer-eval/adapters/ci_prompt.py` | local-review | Low |
| **Premise corrected — no CI impact (verified 2026-06-07).** The "slow CI" motivation does not hold: no CI workflow installs R (no `setup-r` / `r-lib/actions` / `fixest` / `r-base` install anywhere in `.github/workflows/`), so every R-parity test skips in CI behind a per-file availability gate (`fixest_available` in twfe, `_check_r_contdid()` in continuous_did, `require_r` / `r_available` in `conftest.py`, etc.) — consolidating `Rscript` spawns yields zero CI speedup. The originally-cited file already session-caches its R fits: `test_methodology_twfe.py` exposes `r_twfe_results` / `r_twfe_results_with_covariate` as `scope="session"` fixtures, so each R model runs once per session, not once per test. The only residual is a LOCAL-dev micro-optimization for developers who have R installed: `test_methodology_continuous_did.py` (the `_run_r_contdid` helper plus three standalone inline `Rscript` calls) and `test_methodology_callaway.py` (`_run_r_estimation` called inline in three test methods, plus `_get_r_mpdta_and_results` re-run by the MPDTA R-parity tests) re-spawn `library(...)` per call with no session-level result cache. Applying the twfe session-fixture pattern there would speed local R-parity runs only. Low value; retained as a local-dev note. | `tests/test_methodology_continuous_did.py`, `tests/test_methodology_callaway.py` | #139 | Low |
| CS R helpers hard-code `xformla = ~ 1`; no covariate-adjusted R benchmark for IRLS path | `tests/test_methodology_callaway.py` | #202 | Low |
Expand Down
Loading
Loading