diff --git a/METHODOLOGY_REVIEW.md b/METHODOLOGY_REVIEW.md
index 8ef36f8a..9f9611b8 100644
--- a/METHODOLOGY_REVIEW.md
+++ b/METHODOLOGY_REVIEW.md
@@ -588,14 +588,14 @@ and covariate-adjusted specifications.)
 
 **Documentation in place:**
 - REGISTRY.md section: `## TwoStageDiD` (Stage 1 unit+time FE on untreated, Stage 2 OLS on residualized outcomes, GMM sandwich variance per Newey-McFadden Theorem 6.1)
+- Paper review: `docs/methodology/papers/gardner-2022-review.md` (PR-A — eq./section-numbered review of arXiv:2207.05943; corrected a fabricated Eq. 6 variance deviation, see "Documented alignment" below)
 - Implementation: 76 unit tests in `tests/test_two_stage.py` (matches ImputationDiD point estimates, R `did2s` global `(D'D)^{-1}` variance, always-treated unit exclusion, multiplier bootstrap)
-- Documented R alignment: uses global `(D'D)^{-1}` matching `did2s` (not paper Eq. 6)
+- Documented alignment: variance = global `(D'D)^{-1}` GMM sandwich (Newey-McFadden Theorem 6.1, Gardner §3.3) — **faithful to both the paper and `did2s`**. Gardner eq. (6) is the *event-study regression spec*, not a variance formula; the earlier "matches `did2s`, not paper Eq. 6" / "Newey-McFadden sandwich vs paper's Eq. 6 deviation" framing was a misattribution, corrected in PR-A across `REGISTRY.md` + the paper review.
 
 **Outstanding for promotion:**
 - Dedicated `tests/test_methodology_two_stage.py` with paper-equation-numbered Verified Components walk-through
 - R parity benchmark fixture against `did2s` (none on file)
-- Documented deviation: Newey-McFadden Theorem 6.1 sandwich vs paper's Eq. 6 (already noted in REGISTRY but not formalized in this tracker)
-- "Corrections Made" listing
+- "Corrections Made" listing + flip Status → Complete (PR-B)
 
 ---
 
@@ -1444,10 +1444,10 @@ more graceful handling of edge cases while still signaling invalid inference to
 
 Promotion priority for the **In Progress** entries, ordered by what's blocked on substantive review work (top of list = needs review next) vs. consolidation pass (bottom of list = mostly tracker walk-through):
 
-**Substantive-review-blocked (still missing a methodology test file / R parity and a paper review):**
+**Substantive-review-blocked (each still missing one or more of: a methodology test file, R parity, or a paper review):**
 
 1. **PlaceboTests** — decide first whether to keep standalone or absorb into per-estimator diagnostic sections; methodologically lightweight either way.
-2. **TwoStageDiD** — the remaining half of the imputation pair (ImputationDiD is now Complete, validated against `didimputation`). Needs a Gardner (2022) paper review, `tests/test_methodology_two_stage.py`, and an R parity fixture against `did2s`.
+2. **TwoStageDiD** — the remaining half of the imputation pair (ImputationDiD is now Complete, validated against `didimputation`). Gardner (2022) paper review **landed** (`docs/methodology/papers/gardner-2022-review.md`, PR-A); still needs `tests/test_methodology_two_stage.py` and an R parity fixture against `did2s` to flip to Complete (PR-B).
 
 **Consolidation-pass-blocked (already has paper review or methodology file or R parity; mostly Verified Components walk-through):**
 
diff --git a/TODO.md b/TODO.md
index 6beaacd3..045f6fe1 100644
--- a/TODO.md
+++ b/TODO.md
@@ -173,6 +173,7 @@ Deferred items from PR reviews that were not addressed before merge.
 |-------|----------|----|----------|
 | Drift test for tutorial 24 qualitative power claims (monotonic dilution fast→slow; CS-vs-2×2 MDE crossover/near-parity at slow rollout) — pins the prose against estimator-default/simulation drift | `docs/tutorials/24_staggered_vs_collapsed_power.ipynb` | staggered-analysis-2x2 | Low |
 | ImputationDiD covariate-path variance lacks dedicated R `didimputation` parity / hand-calc. The PR-B FE-design correction (keep all unit dummies) affects the covariate projection too, but only the no-covariate staggered panel is R-parity'd (the covariate path shares the same validated projection code and passes the full suite). Add a covariate (time-varying X) R golden asserting overall/event-study SE parity, or a small dense-design hand-calc for the covariate projection. | `tests/test_methodology_imputation.py`, `benchmarks/R/generate_didimputation_golden.R` | imputation-validation follow-up | Low |
+| TwoStageDiD methodology validation PR-B: add `tests/test_methodology_two_stage.py` (eq./section-numbered Verified Components — Stage-1 FE recovery on untreated obs; Stage-2 overall ATT eq. 4 + event-study eq. 6; GMM first-stage-correction behavior; always-treated drop) + `did2s` R parity fixture (`benchmarks/R/generate_did2s_golden.R` + `benchmarks/data/did2s_golden.json` + `did2s_test_panel.csv`); then flip `METHODOLOGY_REVIEW.md` TwoStageDiD row In Progress → Complete. PR-A (paper review `gardner-2022-review.md`) merged separately. | `tests/test_methodology_two_stage.py`, `benchmarks/`, `METHODOLOGY_REVIEW.md` | two-stage-validation PR-B | Medium |
 | Port the CI `<notebook-prose>` extraction into the reviewer-eval harness so `docs/tutorials/*.ipynb` cases (currently guarded out of `verify-corpus`/`run`) can be reviewed with CI-equivalent context | `tools/reviewer-eval/adapters/ci_prompt.py` | local-review | Low |
 | **Premise corrected — no CI impact (verified 2026-06-07).** The "slow CI" motivation does not hold: no CI workflow installs R (no `setup-r` / `r-lib/actions` / `fixest` / `r-base` install anywhere in `.github/workflows/`), so every R-parity test skips in CI behind a per-file availability gate (`fixest_available` in twfe, `_check_r_contdid()` in continuous_did, `require_r` / `r_available` in `conftest.py`, etc.) — consolidating `Rscript` spawns yields zero CI speedup. The originally-cited file already session-caches its R fits: `test_methodology_twfe.py` exposes `r_twfe_results` / `r_twfe_results_with_covariate` as `scope="session"` fixtures, so each R model runs once per session, not once per test. The only residual is a LOCAL-dev micro-optimization for developers who have R installed: `test_methodology_continuous_did.py` (the `_run_r_contdid` helper plus three standalone inline `Rscript` calls) and `test_methodology_callaway.py` (`_run_r_estimation` called inline in three test methods, plus `_get_r_mpdta_and_results` re-run by the MPDTA R-parity tests) re-spawn `library(...)` per call with no session-level result cache. Applying the twfe session-fixture pattern there would speed local R-parity runs only. Low value; retained as a local-dev note. | `tests/test_methodology_continuous_did.py`, `tests/test_methodology_callaway.py` | #139 | Low |
 | CS R helpers hard-code `xformla = ~ 1`; no covariate-adjusted R benchmark for IRLS path | `tests/test_methodology_callaway.py` | #202 | Low |
diff --git a/diff_diff/two_stage_results.py b/diff_diff/two_stage_results.py
index 3bc550e8..af240787 100644
--- a/diff_diff/two_stage_results.py
+++ b/diff_diff/two_stage_results.py
@@ -25,9 +25,10 @@ class TwoStageBootstrapResults:
     Results from TwoStageDiD bootstrap inference.
 
     Bootstrap uses multiplier bootstrap on the GMM influence function,
-    consistent with other library estimators. The R `did2s` package uses
-    block bootstrap by default; multiplier bootstrap is asymptotically
-    equivalent.
+    consistent with other library estimators. The R `did2s` package defaults
+    to analytical corrected clustered SEs (``bootstrap = FALSE``); its optional
+    block bootstrap (``bootstrap = TRUE``) and this multiplier bootstrap are
+    asymptotically equivalent.
 
     Attributes
     ----------
diff --git a/docs/doc-deps.yaml b/docs/doc-deps.yaml
index ae3616a1..46ee8ecf 100644
--- a/docs/doc-deps.yaml
+++ b/docs/doc-deps.yaml
@@ -252,6 +252,8 @@ sources:
       - path: docs/methodology/REGISTRY.md
         section: "TwoStageDiD"
         type: methodology
+      - path: docs/methodology/papers/gardner-2022-review.md
+        type: methodology
       - path: docs/api/two_stage.rst
         type: api_reference
       - path: docs/tutorials/12_two_stage_did.ipynb
diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md
index 7c7c2740..ce725563 100644
--- a/docs/methodology/REGISTRY.md
+++ b/docs/methodology/REGISTRY.md
@@ -1395,20 +1395,20 @@ Point estimates are identical to ImputationDiD (Borusyak et al. 2024). The two-s
 The variance accounts for first-stage estimation error propagating into Stage 2, following the GMM framework:
 
 ```
-V(tau_hat) = (D'D)^{-1} * Bread * (D'D)^{-1}
+V(tau_hat) = (D'D)^{-1} * Meat * (D'D)^{-1}     [(D'D)^{-1} = GLOBAL GMM bread (Jacobian inverse)]
 
-Bread = sum_c ( sum_{i in c} psi_i )( sum_{i in c} psi_i )'
+Meat = sum_c ( sum_{i in c} psi_i )( sum_{i in c} psi_i )'   [score outer-product, clustered at unit]
 ```
 
 where `psi_i` is the stacked influence function for unit i across all its observations, combining the Stage 2 score and the Stage 1 correction term.
 
-**Note on Equation 6 discrepancy:** The paper's Equation 6 uses a per-cluster inverse `(D_c'D_c)^{-1}` when forming the influence function contribution. The R `did2s` implementation and our code use the GLOBAL inverse `(D'D)^{-1}` following standard GMM theory (Newey & McFadden 1994). We follow the R implementation, which is consistent with standard GMM sandwich variance estimation.
+**Variance is faithful to the paper (global Jacobian inverse).** Gardner (2022) §3.3 derives the variance by reading the two stages as a joint GMM estimator (Hansen 1982) and applying Newey & McFadden (1994) Theorem 6.1: `v` is the last element of `E[∂f/∂(λ,γ,β)]^{-1} E[ff'] E[∂f/∂(λ,γ,β)]^{-1'}` — the **global** Jacobian inverse (the `(D'D)^{-1}` bread above), with the score outer-product `E[ff']` clustered at the unit per the reference Stata GMM `vce(cluster id)` (Appendix B). Our global `(D'D)^{-1}` bread + unit-clustered meat **matches** this and the R `did2s` implementation; there is **no** per-cluster inverse. (Equation (6) in the paper is the *event-study regression specification*, not a variance formula — an earlier "Equation 6 per-cluster inverse `(D_c'D_c)^{-1}`" note was a misattribution, corrected per `docs/methodology/papers/gardner-2022-review.md`.)
 
 **No finite-sample adjustments:** The variance estimator uses the raw asymptotic sandwich without degrees-of-freedom corrections (no HC1-style `n/(n-k)` adjustment). This matches the R `did2s` implementation.
 
 *Bootstrap:*
 
-Our implementation uses multiplier bootstrap on the GMM influence function: cluster-level `psi` sums are pre-computed, then perturbed with multiplier weights (Rademacher by default; configurable via `bootstrap_weights` parameter to use Mammen or Webb weights, matching CallawaySantAnna). The R `did2s` package defaults to block bootstrap (resampling clusters with replacement). Both approaches are asymptotically valid; the multiplier bootstrap is computationally cheaper and consistent with the CallawaySantAnna/ImputationDiD bootstrap patterns in this library.
+Our implementation uses multiplier bootstrap on the GMM influence function: cluster-level `psi` sums are pre-computed, then perturbed with multiplier weights (Rademacher by default; configurable via `bootstrap_weights` parameter to use Mammen or Webb weights, matching CallawaySantAnna). The R `did2s` package **defaults to analytical corrected clustered SEs** (`bootstrap = FALSE`, the same GMM sandwich); its block bootstrap is *optional* (`bootstrap = TRUE`, resampling clusters with replacement). All approaches are asymptotically valid; the multiplier bootstrap is computationally cheaper and consistent with the CallawaySantAnna/ImputationDiD bootstrap patterns in this library.
 
 *Edge cases:*
 - **Always-treated units:** Units treated in all observed periods have no untreated observations for Stage 1 FE estimation. These are excluded with a warning listing the affected unit IDs. Their treated observations do NOT contribute to Stage 2.
@@ -1416,7 +1416,7 @@ Our implementation uses multiplier bootstrap on the GMM influence function: clus
 - **NaN y_tilde handling:** When Stage 1 FE are unidentified for some observations, the residualized outcome `y_tilde` is NaN. These observations are zeroed out (excluded) from the Stage 2 regression and variance computation, matching the treatment of unimputable observations in ImputationDiD.
 - **NaN inference for undefined statistics:** t_stat uses NaN when SE is non-finite or zero; p_value and CI also NaN. Matches CallawaySantAnna/ImputationDiD NaN convention.
 - **Event study aggregation:** Horizon-specific effects use the same two-stage procedure with horizon indicator dummies in Stage 2. Unidentified horizons (e.g., long-run effects without never-treated units, per Proposition 5 of Borusyak et al. 2024) produce NaN.
-- **Pre-period event study coefficients (`pretrends=True`):** When enabled, the Stage 2 design matrix `X_2` includes pre-period relative-time dummies. Pre-period observations have `y_tilde = Step 1 residual` by construction. The GMM sandwich variance accounts for Stage 1 estimation error (Gardner 2022, Theorem 1). Only affects event study aggregation; overall ATT unchanged.
+- **Pre-period event study coefficients (`pretrends=True`):** When enabled, the Stage 2 design matrix `X_2` includes pre-period relative-time dummies. Pre-period observations have `y_tilde = Step 1 residual` by construction. The GMM sandwich variance accounts for Stage 1 estimation error (Gardner 2022 §3.3; Newey-McFadden 1994, Theorem 6.1 — the paper has no numbered theorems). Only affects event study aggregation; overall ATT unchanged.
 - **balance_e with no qualifying cohorts:** If no cohorts have sufficient pre/post coverage for the requested `balance_e`, a warning is emitted and event study results contain only the reference period.
 - **No never-treated units (Proposition 5):** When there are no never-treated units and multiple treatment cohorts, horizons h >= h_bar (where h_bar = max(groups) - min(groups)) are unidentified per Proposition 5 of Borusyak et al. (2024). These produce NaN inference with n_obs > 0 (treated observations exist but counterfactual is unidentified) and a warning listing affected horizons. Matches ImputationDiD behavior. Proposition 5 applies to event study horizons only, not cohort aggregation — a cohort whose treated obs all fall at Prop 5 horizons naturally gets n_obs=0 in group effects because all its y_tilde values are NaN.
 - **Zero-observation horizons after filtering:** When `balance_e` or NaN `y_tilde` filtering results in zero observations for some non-Prop-5 event study horizons, those horizons produce NaN for all inference fields (effect, SE, t-stat, p-value, CI) with n_obs=0.
@@ -1436,7 +1436,7 @@ Our implementation uses multiplier bootstrap on the GMM influence function: clus
 - [x] Stage 2: Regress residualized outcomes on treatment indicators
 - [x] Point estimates match ImputationDiD
 - [x] GMM sandwich variance (Newey & McFadden 1994 Theorem 6.1)
-- [x] Global `(D'D)^{-1}` in variance (matches R `did2s`, not paper Eq. 6)
+- [x] Global `(D'D)^{-1}` in variance (faithful to Gardner §3.3 / Newey-McFadden GMM sandwich; matches R `did2s`)
 - [x] No finite-sample adjustment (raw asymptotic sandwich)
 - [x] Always-treated units excluded with warning
 - [x] Multiplier bootstrap on GMM influence function
diff --git a/docs/methodology/papers/gardner-2022-review.md b/docs/methodology/papers/gardner-2022-review.md
new file mode 100644
index 00000000..1a626147
--- /dev/null
+++ b/docs/methodology/papers/gardner-2022-review.md
@@ -0,0 +1,155 @@
+# Paper Review: Two-stage differences in differences
+
+**Authors:** John Gardner (Department of Economics, University of Mississippi)
+**Citation:** Gardner, J. (2022). *Two-stage differences in differences.* arXiv:2207.05943 [econ.EM]. Working paper. URL: [https://arxiv.org/abs/2207.05943](https://arxiv.org/abs/2207.05943).
+**PDF reviewed:** arXiv:2207.05943**v1** (`papers/gardner-2207.05943.pdf`, 34 PDF pages = printed pp. 1–33: content pp. 1–23, References pp. 24–25, Figures/Tables/Appendix C pp. 26–33, incl. Table 4 "Application event study estimates" on printed p. 33). Per the project's PDFs-never-committed convention the local PDF is kept outside the repo (gitignored `papers/`); arXiv is the authoritative source. The title page reads "This version: April 2021"; the arXiv stamp is "arXiv:2207.05943v1 [econ.EM] 13 Jul 2022", so the canonical citation year is **2022** (matching `docs/references.rst` and `diff_diff/two_stage.py`). All equation/section numbers below are this version's.
+**Review date:** 2026-06-07
+
+---
+
+> **Scope note — full paper reviewed (no separate supplement).** Unlike the BJS (2024) REStud article, this is a self-contained working paper: all proofs (Appendix A), the reference Stata GMM implementation (Appendix B), and the full simulation/event-study tables (Appendix C) are **in this PDF**. There is therefore no supplement-only **GAP** class. Equations are numbered **(1)–(8)**; figures **1–4**; tables **1–4** (Table 4 = the Autor-application event-study estimates behind Figure 4, printed p. 33). **The paper contains no Gardner-numbered theorems or propositions** — Appendix A's consistency/unbiasedness results are prose proofs. The variance relies on **Newey & McFadden (1994), Theorem 6.1** (§3.3); §2.2 separately cites **de Chaisemartin & D'Haultfœuille (2020), Theorem 1** for the TWFE-weight representation. So there is no "Gardner Theorem 1" (the REGISTRY attribution of that name was a misattribution; see below). This matters for the REGISTRY reconciliation below.
+
+> **Personal verification stamp (load-bearing items spot-checked against the source).** The reviewer read the full PDF (pp. 1–23 content) directly and verified verbatim: the parallel-trends decomposition eq. (1) and 2×2 regression eq. (2) (pp. 4–5); the DiD-estimand weights eq. (3) (p. 7); the two-stage procedure Steps 1–2 (pp. 8–9); the estimands eqs. (4)–(5) (p. 10); the event-study spec **eq. (6)** and Step 2′ (pp. 11–12); the inference §3.3 GMM moment condition and the **unnumbered** Newey-McFadden Theorem 6.1 sandwich (pp. 12–13); the Appendix A FWL derivation eqs. (7)–(8) (pp. 20–21); and the **Appendix B reference Stata GMM** (`vce(cluster id)`, `onestep`, `winitial(identity)`) (p. 23). **Two REGISTRY misattributions were found and corrected in this PR (see "Gaps and Uncertainties"):** (i) the `## TwoStageDiD` "Note on Equation 6 discrepancy" claimed the *paper's Equation 6* uses a per-cluster inverse `(D_c'D_c)^{-1}` — but **eq. (6) is the event-study regression specification, and the variance (unnumbered, §3.3) uses the GLOBAL Jacobian inverse** `E[∂f/∂(λ,γ,β)]^{-1}`, exactly what the library computes; (ii) an edge-case note attributed the first-stage variance correction to "Gardner 2022, Theorem 1", which does not exist (the paper has no Theorem 1).
+
+---
+
+## Methodology Registry Entry
+
+*This file is the canonical **scholarly review** of the **primary source** (arXiv:2207.05943) for the library estimator `TwoStageDiD` (`two_stage.py`, `two_stage_bootstrap.py`, `two_stage_results.py`). It **supplements — it does not supersede** — the existing `docs/methodology/REGISTRY.md` `## TwoStageDiD` section, which remains the implementation contract. Heading/format aligned with REGISTRY conventions for ease of cross-reference.*
+
+## TwoStageDiD
+
+**Primary source:** Gardner (2022), *Two-stage differences in differences*, arXiv:2207.05943.
+
+**Central thesis (§1; Abstract):** When treatment adoption is staggered and average effects are heterogeneous across groups and periods, the usual difference-in-differences (and event-study) regression is **misspecified** for conditional mean outcomes — outcomes are not linear in group, period, and treatment status — so its coefficient is a hard-to-interpret (possibly negatively) weighted average of group×period effects (§2.2, eq. (3); following de Chaisemartin & D'Haultfœuille 2020, Goodman-Bacon 2018, Sun & Abraham 2020). Gardner's fix exploits a simple observation: **under parallel trends, untreated potential outcomes are linear in group and period effects.** So estimate those effects from the *untreated* subsample (Stage 1), subtract them from observed outcomes, and regress the adjusted outcomes on treatment status (Stage 2). This recovers the overall ATT `E(β_gp | D_gp = 1)` even under arbitrary heterogeneity, is trivial to implement, and yields valid asymptotic SEs via a joint-GMM reading of the two stages.
+
+**Key implementation requirements:**
+
+*Setup & notation (§2.1):*
+- Units `i`, calendar time `t`. Treatment **groups** `g ∈ {0,1,…,G}` and **periods** `p ∈ {0,1,…,P}` defined by staggered adoption: group `0` is untreated in all periods, group `1` is treated from period `1`, groups `1,2` from period `2`, etc. (group = cohort; period = treatment-onset epoch).
+- `Y_gpit`, `Y_1gpit`, `Y_0gpit` = observed, treated, and untreated potential outcomes for the `i`th member of group `g` at time `t` of period `p`.
+- `D_gp` = indicator that members of group `g` are treated in period `p`.
+- `β_gp = E(Y_1gpit − Y_0gpit | g, p)` = the **group×period average treatment effect**.
+
+*Identifying assumption — Parallel trends, eq. (1), p. 4:*
+```
+E(Y_gpit | g, p, D_gp) = λ_g + γ_p + β_gp D_gp          ... (1)
+```
+- Absent treatment, mean untreated outcomes decompose additively into group effects `λ_g` and period effects `γ_p`. This is the *only* identifying assumption; everything else is estimation.
+
+*The DiD-regression problem (§2.2):* the 2×2 regression `Y_gpit = λ_g + γ_p + β D_gp + ε` (eq. (2)) has probability limit
+```
+β* = Σ_{g=1}^{G} Σ_{p=g}^{P} ω_gp β_gp ,   ω_gp per eq. (3)          ... (3)
+```
+with weights `ω_gp` that can be **negative** and that systematically under-weight long-treated groups/early periods (the heterogeneity-robustness failure this paper fixes). `ω_11 = 1` with a single treatment group, so the 2×2 case is unaffected (fn 6).
+
+*Main estimator — two-stage procedure (§3):*
+```
+Step 1 (untreated subsample, D_gp = 0):   Y_gpit = λ_g + γ_p + ε_gpit   →  λ̂_g, γ̂_p
+Step 2 (all observations):                regress  (Y_gpit − λ̂_g − γ̂_p)  on  D_gp   →  β̂
+```
+- **Identification (§3, Appendix A):** as long as there are untreated *and* treated observations for each group and period, `λ_g, γ_p` are identified from the untreated subpopulation, and Step 2 consistently estimates `E(β_gp | D_gp = 1)` even under arbitrary group×period heterogeneity. Consistency/unbiasedness proved in Appendix A (prose proofs, not numbered).
+- **First-stage flexibility (fn 8):** Stage 1 need not use *only* untreated observations — any correctly specified conditional-mean model works (e.g. a full-sample model with treatment-status×period interactions, or group×period treatment indicators). Using the full sample can be **more efficient** (this is the "Two-stage (incl. treat×time in first stage)" row of Table 2). Restricting to untreated obs introduces **no** sample-selection bias because selection is on treatment status, a deterministic function of `(g,p)` (fn 10).
+
+*Estimands (§3.1):*
+```
+Overall ATT (default):     E(β_gp | D_gp = 1) = Σ_g Σ_p β_gp · P(g, p | D_gp = 1)        ... (4)
+P̄-period average:          Σ_g Σ_{p=g}^{g+P̄-1} β_gp · P(g | D_g = 1) / P̄                  ... (5)
+```
+- Eq. (4) is the default target (expectation over all observed treated units and periods). Eq. (5) averages group-specific effects over a **common completed duration `P̄`** — recover it by restricting the Stage-2 sample to treatment durations `≤ P̄`. The `P̄`-average is more comparable across groups when durations differ (it ignores effects beyond `P̄`).
+
+*Event studies (§3.2):*
+```
+Standard (misspecified under heterogeneity), eq. (6):
+   Y_gpit = λ_g + γ_p + Σ_{r=-R}^{P} β_r D_rgp + ε_gpit                  ... (6)
+   where, for r ≤ 0, D_rgp ∈ {D_{-R,gp},…,D_{0gp}} are (r+1)-period treatment-adoption LEADS,
+   and for r > 0, D_rgp ∈ {D_{1gp},…,D_{Pgp}} are r-period treatment-DURATION indicators.
+
+Two-stage event-study, Step 2′:
+   regress (Y_gpit − λ̂_g − γ̂_p) on D_{-R,gp},…,D_{0gt},…,D_{Pgp}   →  Ê(β_rgp | D_rgp = 1)
+```
+- **`D_rgp` is the event-study (relative-time) design** — leads for `r ≤ 0` (placebo/pre-trend tests), duration indicators for `r > 0`. (fn 12: calendar time `t` may replace coarse periods `p`. fn 13: duration↔period correspondence `β_rgp = β_{g,p-g+1}`.)
+- Like the static case, the **standard** event-study spec (eq. 6) is contaminated under heterogeneity — its leads can be nonzero even under true parallel trends (Sun & Abraham 2020). The two-stage Step 2′ fixes this (Figure 2 confirms: standard spec shows a spurious pre-trend dip; two-stage does not). fn 14: the Step-2′ estimand averages over groups with duration `≥ r` (the Sun–Abraham interaction-weighted target).
+
+*Standard errors — joint GMM, Newey-McFadden (1994) Theorem 6.1, §3.3, pp. 12–13:*
+SEs must account for the **generated dependent variable** `Y_gpit − λ̂_g − γ̂_p` (Dumont et al. 2005). Read the two stages as a single GMM estimator (Hansen 1982) with parameter `(λ, γ, β)` solving the moment condition
+```
+E[ f(λ, γ, β; W_gpit) ] = E [ [Y − (1g)'λ − (1p)'γ] · [(1g), (1p)]' · (1 − D_gp)     ]  = 0
+                              [ [Y − (1g)'λ − (1p)'γ − β D_gp] · D_gp                  ]
+```
+where `W_gpit = [Y_gpit, (1g)_gpit, (1p)_gpit, D_gp]`. The first moment block is restricted to **untreated** observations (the `(1 − D_gp)` factor) and identifies `λ, γ`; the second block identifies `β`. By Newey-McFadden Theorem 6.1, `√N(β̂ − β) →_d N(0, v)`, where `v` is the **last element** of
+```
+   E[ ∂f/∂(λ,γ,β) ]^{-1} · E[ f f' ] · E[ ∂f/∂(λ,γ,β) ]^{-1′}
+```
+- **This is the standard GMM sandwich with the GLOBAL Jacobian inverse `E[∂f/∂(λ,γ,β)]^{-1}` (the "bread") and the score outer-product `E[ff']` (the "meat").** There is **no per-cluster inverse** anywhere in the paper.
+- "Asymptotics for variations on the two-stage estimator (such as the event-study version) are similar" (p. 13).
+- **Reference implementation = Appendix B Stata GMM (p. 23):** `gmm (eq1: (y - {xb: i.year} - {xg: ibn.id})*(1-d)) (eq2: y - {xb:} - {xg:} - {delta}*d), instruments(eq1: i.year ibn.id) instruments(eq2: d) winitial(identity) onestep quickderivatives vce(cluster id)`. Key choices: **`vce(cluster id)`** clusters the meat at the unit (`id`) level with the **global** GMM bread; **`onestep` + `winitial(identity)`** (no efficient-GMM reweighting); **no finite-sample multiplier**. The empirical application clusters on **state** (Table 2 / Figure 4 notes; fn 19).
+
+*Edge cases / boundary conditions (from the paper):*
+- **Already-treated-in-first-period units → exclude.** The empirical application "exclude[s] units that were already treated in 1979, the first year of the sample, since no treatment effects are identified for this group" (fn 19). Maps to the library's always-treated exclusion.
+- **Rank/identification:** `λ_g, γ_p` require untreated *and* treated observations for each group and period. A post-adoption period with no untreated units leaves its `γ_p` unidentified (the Gardner-native analogue of the no-never-treated problem). The library's rank-deficiency guard (`rank_deficient_action`) handles this.
+- **Pre-trend testing:** event-study leads `D_rgp, r ≤ 0` serve as placebo tests, but only the **two-stage** leads (Step 2′) are heterogeneity-robust; the standard-spec leads (eq. 6) are not.
+- **Covariates (fn 9):** include time-varying covariates in *both* regressions (analogous to regression DiD — assumes treatment doesn't influence the covariates and covariates don't influence the effect); or adjust by the last pre-treatment realization `X*_gpit`; or a doubly-robust combination. All are "more parametric than" the CS (2020) inverse-probability-weighting approach. (Library covariate support is the same `covariates=` story as ImputationDiD.)
+
+**Reference implementation(s):**
+- Stata: Appendix B GMM syntax (above); the `did2s` Stata command (Butts & Gardner).
+- R: **`did2s`** (Kyle Butts & John Gardner) — **the library's R parity reference for PR-B.**
+- The two-stage estimator coincides in point estimates with the imputation estimator (Borusyak, Jaravel & Spiess 2024) and with `ImputationDiD` in this library (the natural validation pair).
+
+**Requirements checklist** (paper → library; ✓ = implemented + verified in source review, ⚠️ = paper-permitted but **not exposed** by the library, ☐ = PR-B deliverable):
+- [x] Stage 1 group+period FE fit on **untreated obs only** (`D_gp = 0`), iterative for unbalanced panels (`_iterative_fe`).
+- [x] Stage 2 regress residualized `Y − λ̂ − γ̂` on `D_gp` → overall ATT eq. (4) (`_stage2_static`); event-study Step 2′ → eq. (6) horizons (`_stage2_event_study`; `horizon_max`, `pretrends`); plus a library cohort/group aggregation (`_stage2_group`, a CS-style extension — not a distinct Gardner estimand).
+- [x] **Variance = joint-GMM Newey-McFadden Thm 6.1 sandwich with the GLOBAL Jacobian inverse, meat clustered at the unit** (Appendix B `vce(cluster id)`); **no finite-sample multiplier**. Library matches this (the global-bread sandwich `(X'₂WX₂)^{-1}(S'S)(X'₂WX₂)^{-1}` with unit-clustered GMM-corrected score `S_g = γ̂'c_g − X'_{2g}ε_{2g}`).
+- [x] Always-treated exclusion (fn 19); covariates in both stages (fn 9, `covariates=` — the in-both-regressions approach); `anticipation` handling.
+- ⚠️ **Paper-permitted but NOT exposed by `TwoStageDiD`** (recorded as scope, not a deliverable): the **`P̄`-average estimand eq. (5)** (duration-restricted Stage-2 sample) and the **full-sample first-stage variant (fn 8)** — both described in the paper as valid modifications, but the library has no public parameter for either (`get_params()` = `anticipation, alpha, cluster, n_bootstrap, bootstrap_weights, seed, rank_deficient_action, horizon_max, pretrends, vcov_type`; Stage 1 is always untreated-only). The fn-9 last-pre-treatment / doubly-robust covariate variants are likewise not exposed (only the in-both-stages approach is).
+- [ ] **Dedicated `tests/test_methodology_two_stage.py`** with eq./section-numbered Verified Components (Stage-1 FE recovery; Stage-2 overall ATT eq. 4 + event-study eq. 6; GMM first-stage-correction behavior; always-treated drop). — *PR-B.*
+- [ ] **R parity fixture vs `did2s`** (`benchmarks/R/generate_did2s_golden.R` + `benchmarks/data/did2s_golden.json` + `did2s_test_panel.csv`); overall ATT + event-study ATT/SE. — *PR-B.*
+- [x] **`two_stage.py` docstring checked — already accurate.** `_compute_gmm_variance`'s docstring (`two_stage.py:2842-2857`) states the code uses the GLOBAL Hessian inverse matching `did2s` and gives the correct sandwich `V = (X'₂X₂)^{-1}(Σ_g S_g S_g')(X'₂X₂)^{-1}`; it does **not** fabricate a paper-Eq.6 deviation, so **no source correction is needed**. The misattribution was confined to the REGISTRY note (fixed in this PR).
+
+---
+
+## Implementation Notes
+
+### Data Structure Requirements
+- Long panel: unit id, time, outcome, and a cohort/event-date or absorbing-treatment column from which `D_gp` and relative time are derived. Panel need not be balanced; treatment must be staggered/absorbing.
+- Optional: time-varying covariates `X_gpit` (treatment-unaffected & effect-neutral per fn 9), cluster/PSU columns, survey weights.
+
+### Computational Considerations
+- Dominant cost = the Stage-1 FE solve on the untreated subsample (alternating-projection/within transform; `_iterative_fe`). Stage 2 is a small OLS on the residualized outcome.
+- The GMM variance needs the Stage-1→Stage-2 correction term (`γ̂' c_g`); the library precomputes `γ̂ = (X'₁₀WX₁₀)^{-1}(X'₁WX₂)` via a sparse factorization with a dense-`lstsq` fallback (both warn on near-singularity).
+
+### Tuning Parameters
+
+**Library-exposed knobs** (`TwoStageDiD.get_params()`):
+
+| Parameter | Type | Default | Notes |
+|-----------|------|---------|-------|
+| Estimand | (output) | overall ATT (eq. 4) | Results carry overall ATT eq. (4) + event-study eq. (6) (`horizon_max`, `pretrends`) + a cohort/group aggregation; selected at read-time, not a fit param |
+| `cluster` | column | unit (`cluster=None` → `unit`) | Appendix B clusters on `id`; the application clusters on state |
+| `covariates=` (fit arg, fn 9) | list | none | Time-varying covariates residualized in **both** stages (the fn-9 in-both-regressions approach) |
+| `anticipation` | int | 0 | Periods of anticipation (shifts the event date earlier) |
+| `n_bootstrap` / `bootstrap_weights` | int / str | 0 / rademacher | Optional multiplier bootstrap (library extension; analytical GMM SE is the default) |
+
+No regularization, bandwidth, factor-count, or cross-validation tuning (the estimator is unregularized OLS in each stage). **Paper-permitted variants NOT exposed:** the eq. (5) `P̄`-average / duration-restricted estimand and the fn-8 full-sample first-stage variant (see the ⚠️ row in the Requirements checklist).
+
+### Relation to Existing diff-diff Estimators
+- **`TwoStageDiD` (this estimator):** direct implementation of the §3 two-stage procedure + §3.3 GMM variance. The REGISTRY `## TwoStageDiD` section is the implementation contract; this review is its primary-source backing.
+- **`ImputationDiD` (Borusyak, Jaravel & Spiess 2024):** **point-estimate-identical** twin — both estimate unit/group+time FE on untreated observations and recover effects from observed-minus-counterfactual. They differ in **inference**: TwoStageDiD uses the GMM sandwich that folds in Stage-1 estimation error; ImputationDiD uses the conservative Theorem-3 variance. (BJS 2024 p. 3258 lists Gardner's estimator as coinciding with theirs for Gardner's estimand class.)
+- **`StackedDiD`:** the paper's §5/Appendix §6.1 "stacked" estimator is a *different* heterogeneity-robust device (group-specific datasets, dataset-specific FE); Gardner's Table 1–2 compare two-stage vs stacked vs aggregated.
+- **Reusable library machinery:** `_iterative_fe` / within-transform (Stage 1); `solve_ols` (Stage 2); `safe_inference` for the GMM SE → t/p/CI; the SpilloverDiD `_compute_gmm_corrected_meat` is the sibling GMM-correction machinery (same first-stage-uncertainty structure); CallawaySantAnna/ImputationDiD NaN convention for unidentified cells; multiplier bootstrap is a **library extension** — Gardner (2022) proposes **no** bootstrap (analytical GMM SEs only, §3.3 + Appendix B Stata GMM). R `did2s` likewise **defaults to analytical corrected clustered SEs** (`bootstrap = FALSE`, the same GMM sandwich); its block bootstrap is *optional* (`bootstrap = TRUE`), not a Gardner-paper prescription. The library's multiplier bootstrap mirrors the CallawaySantAnna/ImputationDiD pattern.
+
+---
+
+## Gaps and Uncertainties
+
+**No supplement-only GAP class** — the paper is self-contained (proofs in Appendix A, reference code in Appendix B, full tables in Appendix C). The reviewed PDF rendered cleanly; no `[UNREADABLE EQUATION]`. The only uncertainties are reconciliations against the existing REGISTRY:
+
+| Item | Finding | Disposition |
+|------|---------|-------------|
+| **REGISTRY "Note on Equation 6 discrepancy"** (`## TwoStageDiD`) claimed *the paper's Equation 6* uses a per-cluster inverse `(D_c'D_c)^{-1}` that our global-inverse code "deviates" from. | **Misattribution.** Paper **eq. (6) is the event-study regression specification**; the variance is **unnumbered** (§3.3) and uses the **global Jacobian inverse** `E[∂f/∂(λ,γ,β)]^{-1}` (Newey-McFadden Thm 6.1). There is **no per-cluster inverse** in the paper. Our implementation **matches** the paper (and `did2s`) — it is **not** a deviation. | **Corrected in this PR** (REGISTRY note rewritten to state faithfulness; the cluster-summed filling, previously mislabeled "Bread", relabeled "Meat"). The `two_stage.py` docstring (`:2842-2857`) was checked and is **already accurate** (it describes the code's global inverse + `did2s` match and gives the correct sandwich; no fabricated paper deviation) — **no source change needed**. |
+| **REGISTRY edge-case note** "the GMM sandwich variance accounts for Stage 1 estimation error (Gardner 2022, Theorem 1)". | **Misattribution.** Gardner (2022) has **no Theorem 1** (no numbered theorems at all). The Stage-1 correction comes from the joint-GMM reading + **Newey-McFadden (1994) Theorem 6.1**. | **Corrected in this PR** ("Gardner 2022, Theorem 1" → "Gardner 2022 §3.3; Newey-McFadden 1994, Theorem 6.1"). |
+| **Proposition-5 / "no never-treated → horizons ≥ H̄ unidentified"** guard in the library + REGISTRY. | Gardner (2022) does **not** prove this — it is a **Borusyak, Jaravel & Spiess (2024)** result (their Proposition 5). Gardner's native identification requirement is the weaker "untreated obs for each period." | **No change** — the REGISTRY already correctly attributes this guard to "Proposition 5 of Borusyak et al. (2024)" (lines 1411, 1414). Noted here for clarity: it is a twin-derived extension, not a Gardner result. |
+| **"No finite-sample multiplier" deviation note** (REGISTRY `## TwoStageDiD`). | **Accurate and correctly labeled.** Appendix B's `onestep`/`vce(cluster id)` GMM carries no `(n-1)/(n-p)` or `G/(G-1)` factor; the library matches (meat `= S'S`). | **No change** — verified faithful. |
+| **Version.** Reviewed arXiv:2207.05943**v1** (title page "April 2021"; arXiv-stamped 13 Jul 2022). | Equation numbers are this version's. The misattributions above are **independent of versioning**: even if a later version renumbered the variance to "(6)", the per-cluster-inverse claim is false regardless (the variance uses a global inverse). | Citation pinned to **Gardner (2022)** per `docs/references.rst`. |
+
+**PR-B deliverables** (tracked in `TODO.md` Testing/Docs): `tests/test_methodology_two_stage.py` (Stage-1/Stage-2/GMM-correction/always-treated Verified Components); `did2s` R parity fixture; then flip the `METHODOLOGY_REVIEW.md` `TwoStageDiD` row **In Progress → Complete**. (No `two_stage.py` source change is required — the docstring is already accurate; the only documentation defect, the REGISTRY misattributions, is fixed in this PR.)