You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Bug fix (non-breaking change which fixes an issue)
Why
PR validation flakes intermittently on the E2E Fabric job
(PR (Tests E2E Test App Fabric X64Hermes)), with no usable debugging signal —
when the step fails the artifacts contain no crash or hang dump. Re-runs
eventually go green without a code change, which masks any real underlying
regression. Additionally, X86 has historically been validated only post-merge
on Continuous, which means X86-specific regressions cause a revert cycle
rather than a clean PR rejection.
This PR addresses the problem on three tracks:
Make E2E failures debuggable — register Windows Error Reporting
LocalDumps for the test app and node, capture full-memory dumps from
surviving processes after a failed step, install an in-process unhandled-
exception filter as the primary crash mechanism (hosted CI agents route WER
through a corporate-server policy that silently ignores per-exe LocalDumps),
and bundle matching PDBs and a debugging README into the Crash dumps - <job> artifact.
Fix the two known flake families that produced no dump (because there was
no crash to dump):
±1 px height drift on text-rendered SpriteVisuals — Composition
snaps a near-half-integer DWrite text measurement to either side of the
integer boundary on different commits. Hardened the dumpVisualTree
path to take multiple stable readings.
searchBox helper timeout ("Unable to enter correct search text into
test searchbox") in 8 component-test files plus RNTesterNavigation.ts. WinAppDriver's setValue falls back to
synthesized keystrokes for custom RN TextInputs, which append
rather than replace — so waitUntil retries make the state worse, not
better. Clearing the field before each setValue plus a faster retry
cadence resolves it.
Make transient install-step failures self-heal — yarn install / npx ... fetches occasionally die mid-flight on the hosted agent (on 0.84-stable this surfaces as midgard-yarn-strict exiting 57005 /
0xDEAD right after [2/2] Fetching packages…; PR build 630484 is the
reference case). The fetch helper is killed by the OS before yarn's own
limited retry path can engage, so a manual re-run is needed today. Wire
ADO's built-in retryCountOnTaskFailure: 2 on every install / init /
lage step so a single transient retries automatically before failing
the build.
What
Crash-dump collection mechanism
.ado/scripts/SetupLocalDumps.cmd — made idempotent (reg add /f),
parameterized on dump folder, registers the exe in AeDebug\AutoExclusionList
so WER wins over the JIT path.
.ado/templates/prepare-build-env.yml — new opt-in localDumpsExeNames
array parameter (default [] → no-op for existing callers); iterates and
registers each name with SetupLocalDumps.cmd. Also grants SYSTEM:(OI)(CI)F and Users:(OI)(CI)F ACLs on CrashDumpRootPath so the
WER service (LocalSystem) and packaged apps can write dumps there.
.ado/jobs/e2e-test.yml:
Passes [RNTesterApp-Fabric, node] to prepare-build-env.yml.
On test failure: a Capture dumps of surviving test processes step runs procdump64 -ma against any still-alive RNTesterApp-Fabric / node,
writing into $(CrashDumpRootPath)\hang\ (subfolder is required — files
written at the root were observed to disappear during the post-failure Update snapshots step on hosted agents).
A Collect in-process and fallback crash dumps step copies any
in-process minidumps from %ProgramData%\RNW-E2E-Dumps\ into $(CrashDumpRootPath)\in-process\, and scans common WER fallback
locations into $(CrashDumpRootPath)\recovered\.
A Bundle symbols and README with crash dumps step copies all *.pdb
from the test app's Release output into $(CrashDumpRootPath)\symbols\
(mirroring the deploy tree) and writes a README.md at the artifact
root with WinDbg instructions and _NT_SYMBOL_PATH wiring. Gated on
actual .dmp / .mdmp files existing — $(CrashDumpRootPath) doubles
as MSBUILDDEBUGPATH, so build-time MSBuild failure logs land there
too without needing symbols bundled.
Two opt-in pipeline parameters, both defaulted false, for re-validating
the crash and hang capture paths when an agent image change forces a
re-check: simulateCrashForTesting and simulateHangForTesting. The
crash path uses a sentinel file at %ProgramData%\rnw-e2e-simulate-crash.flag;
the hang path uses an env var RNW_SIMULATE_HANG=1 that gates a new HangSimulationTest.test.ts.
InstallInProcessCrashDumpWriter() — top-level SetUnhandledExceptionFilter
that writes MiniDumpWithFullMemory | WithHandleData | WithThreadInfo | WithUnloadedModules | WithProcessThreadData to %ProgramData%\RNW-E2E-Dumps\RNTesterApp-Fabric-<timestamp>-<pid>.dmp,
then returns EXCEPTION_CONTINUE_SEARCH so the process still terminates
and any downstream handlers run.
MaybeSimulateCrashForTesting() — flag-file-gated null-pointer write for
crash-path validation.
HangForTesting automation command — Posts Sleep(INFINITE) onto the UI
dispatcher, jamming the UI thread on the next work item (realistic
deadlock shape).
packages/e2e-test-app-fabric/test/HangSimulationTest.test.ts — opt-in
test (auto-skips unless RNW_SIMULATE_HANG=1) that drives HangForTesting
and lets the test step time out so the post-failure ProcDump capture path
has a hung packaged-app process to dump.
Snapshot dump stabilization
RNTesterApp-Fabric.cpp — DumpVisualTree now takes up to 3 dumps with
50 ms gaps and returns the first dump that matches its successor (i.e. two
consecutive dumps stringify identically). Targets composition's per-commit
rounding non-determinism on text-derived Visual::Size values (~24.5 → 24
vs 25 across commits). No client / test / snapshot changes; ~100 ms added
per dumpVisualTree call.
searchBox helper flake
Same flake-prone pattern duplicated across 9 sites. Updated all of them with
a single fix:
Added await searchBox.clearValue(); before setValue() inside the poll
callback. Without the clear, retries append to existing text and the getText() === input comparison never converges.
Bumped timeout: 5000 → 10000 and reduced interval: 1500 → 500 for more
retries within a longer window.
Yarn's own retry path (Yarn classic / Berry, and the midgard-yarn-strict
fork still in use on backport branches) only auto-retries on a small set
of network errors (ECONNRESET / ESOCKETTIMEDOUT / ETIMEDOUT / ENOTFOUND). Other transient failures — including a fetch helper killed
mid-flight (the observed mode on 0.84-stable, exit code 57005 / 0xDEAD,
PR build 630484) — bypass that retry path entirely and propagate straight
up. ADO supports retryCountOnTaskFailure: 2 at the step level for
exactly this case.
Added to every step on main that fetches from the npm registry:
.ado/build-template.yml — Strict yarn install + Build prepare-release and beachball-config.
.ado/prepare-release-bot.yml — yarn install + Build prepare-release and dependencies.
.ado/templates/strict-yarn-install.yml and .ado/templates/yarn-install.yml (the canonical per-build install).
Cost when the install passes first try: zero. When it flakes: ADO retries
the step up to twice before failing the build, visible in the pipeline UI
as explicit retry attempts so genuine deterministic failures still surface
clearly within ~1 minute instead of being masked by a manual re-run cycle.
Spell-check enforcement
.cspell.json — "language": "en" → "language": "en-US". The broader en dictionary accepts both British and American spellings, which let
inconsistencies (behaviour, synthesised) drift in via reviewer / AI
contributions without surfacing as warnings. en-US flags those forms in
the IDE so they get caught at edit time.
PR validation matrix — add X86Hermes
.ado/jobs/e2e-test.yml — added X86Hermes to the PullRequest
buildMatrix so it now matches the Continuous matrix. Previously
X86Hermes was deferred to post-merge (since PR Revert expansion of PR flavors #8957 in Oct 2021).
Aligns the gating with what 0.84-stable and earlier release branches
already enforce on PR, and means an X86-specific regression now blocks
merge instead of triggering a revert cycle. Trade-off: roughly doubles
E2E job wall time on PR runs and the agent capacity used per PR. We
judge that worth it given how often E2E flakes have hidden real
regressions; happy to discuss if reviewers disagree.
Screenshots
N/A — pipeline / native / test changes only.
Testing
The crash-dump mechanism, multi-dump, and searchBox fixes were validated
end-to-end on the equivalent 0.84-stable PR (#16045) which contains the same
code. Reproducing here for reviewer convenience:
Crash simulation (build 630442): simulateCrashForTesting=true → MaybeSimulateCrashForTesting reads the sentinel flag and dereferences a
null pointer at startup → InstallInProcessCrashDumpWriter's UEF writes
full-memory .dmp files (~32 MB each) to %ProgramData%\RNW-E2E-Dumps\ → diagnostic step copies them into the
artifact under in-process/.
Hang simulation (build 630470): simulateHangForTesting=true → HangForTesting posts Sleep(INFINITE) onto the UI dispatcher → jest test
step times out → post-failure ProcDump captures full-memory dumps of the
still-alive packaged app (~250 MB each) under hang/. Confirmed dumps
ride to the artifact intact for both X64Hermes and X86Hermes. With the
matrix change in this PR, both architectures now run on this PR's own
validation as well.
The snapshot multi-dump fix is observed in build 630476: all 828 snapshots
passed across the suite. The searchBox fix targets the failure mode of
that same build (TextInput triggers onPressIn and updates state text →
"Unable to enter correct search text into test searchbox" at 5095 ms);
PR validation on this branch is the first run with the fix in place.
The crash-dump artifact format is documented inline in $(CrashDumpRootPath)\README.md, written by the bundle step.
Changelog
Should this change be included in the release notes: no
This is internal CI / test infrastructure. No runtime impact for consumers
of react-native-windows. The only product-code change is the in-process
crash-dump writer in the E2E test app (RNTesterApp-Fabric), which is not
shipped.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Type of Change
Why
PR validation flakes intermittently on the E2E Fabric job
(
PR (Tests E2E Test App Fabric X64Hermes)), with no usable debugging signal —when the step fails the artifacts contain no crash or hang dump. Re-runs
eventually go green without a code change, which masks any real underlying
regression. Additionally, X86 has historically been validated only post-merge
on Continuous, which means X86-specific regressions cause a revert cycle
rather than a clean PR rejection.
This PR addresses the problem on three tracks:
Make E2E failures debuggable — register Windows Error Reporting
LocalDumps for the test app and node, capture full-memory dumps from
surviving processes after a failed step, install an in-process unhandled-
exception filter as the primary crash mechanism (hosted CI agents route WER
through a corporate-server policy that silently ignores per-exe LocalDumps),
and bundle matching PDBs and a debugging README into the
Crash dumps - <job>artifact.Fix the two known flake families that produced no dump (because there was
no crash to dump):
SpriteVisuals — Compositionsnaps a near-half-integer DWrite text measurement to either side of the
integer boundary on different commits. Hardened the
dumpVisualTreepath to take multiple stable readings.
searchBoxhelper timeout ("Unable to enter correct search text intotest searchbox") in 8 component-test files plus
RNTesterNavigation.ts. WinAppDriver'ssetValuefalls back tosynthesized keystrokes for custom RN
TextInputs, which appendrather than replace — so
waitUntilretries make the state worse, notbetter. Clearing the field before each
setValueplus a faster retrycadence resolves it.
Make transient install-step failures self-heal —
yarn install/npx ...fetches occasionally die mid-flight on the hosted agent (on0.84-stablethis surfaces asmidgard-yarn-strictexiting 57005 /0xDEAD right after
[2/2] Fetching packages…; PR build 630484 is thereference case). The fetch helper is killed by the OS before yarn's own
limited retry path can engage, so a manual re-run is needed today. Wire
ADO's built-in
retryCountOnTaskFailure: 2on every install / init /lage step so a single transient retries automatically before failing
the build.
What
Crash-dump collection mechanism
.ado/scripts/SetupLocalDumps.cmd— made idempotent (reg add /f),parameterized on dump folder, registers the exe in
AeDebug\AutoExclusionListso WER wins over the JIT path.
.ado/templates/prepare-build-env.yml— new opt-inlocalDumpsExeNamesarray parameter (default
[]→ no-op for existing callers); iterates andregisters each name with
SetupLocalDumps.cmd. Also grantsSYSTEM:(OI)(CI)FandUsers:(OI)(CI)FACLs onCrashDumpRootPathso theWER service (LocalSystem) and packaged apps can write dumps there.
.ado/jobs/e2e-test.yml:[RNTesterApp-Fabric, node]toprepare-build-env.yml.Capture dumps of surviving test processesstep runsprocdump64 -maagainst any still-aliveRNTesterApp-Fabric/node,writing into
$(CrashDumpRootPath)\hang\(subfolder is required — fileswritten at the root were observed to disappear during the post-failure
Update snapshotsstep on hosted agents).Collect in-process and fallback crash dumpsstep copies anyin-process minidumps from
%ProgramData%\RNW-E2E-Dumps\into$(CrashDumpRootPath)\in-process\, and scans common WER fallbacklocations into
$(CrashDumpRootPath)\recovered\.Bundle symbols and README with crash dumpsstep copies all*.pdbfrom the test app's Release output into
$(CrashDumpRootPath)\symbols\(mirroring the deploy tree) and writes a
README.mdat the artifactroot with WinDbg instructions and
_NT_SYMBOL_PATHwiring. Gated onactual
.dmp/.mdmpfiles existing —$(CrashDumpRootPath)doublesas
MSBUILDDEBUGPATH, so build-time MSBuild failure logs land theretoo without needing symbols bundled.
false, for re-validatingthe crash and hang capture paths when an agent image change forces a
re-check:
simulateCrashForTestingandsimulateHangForTesting. Thecrash path uses a sentinel file at
%ProgramData%\rnw-e2e-simulate-crash.flag;the hang path uses an env var
RNW_SIMULATE_HANG=1that gates a newHangSimulationTest.test.ts.packages/e2e-test-app-fabric/windows/RNTesterApp-Fabric/RNTesterApp-Fabric.cpp:InstallInProcessCrashDumpWriter()— top-levelSetUnhandledExceptionFilterthat writes
MiniDumpWithFullMemory | WithHandleData | WithThreadInfo | WithUnloadedModules | WithProcessThreadDatato%ProgramData%\RNW-E2E-Dumps\RNTesterApp-Fabric-<timestamp>-<pid>.dmp,then returns
EXCEPTION_CONTINUE_SEARCHso the process still terminatesand any downstream handlers run.
MaybeSimulateCrashForTesting()— flag-file-gated null-pointer write forcrash-path validation.
HangForTestingautomation command — PostsSleep(INFINITE)onto the UIdispatcher, jamming the UI thread on the next work item (realistic
deadlock shape).
packages/e2e-test-app-fabric/test/HangSimulationTest.test.ts— opt-intest (auto-skips unless
RNW_SIMULATE_HANG=1) that drivesHangForTestingand lets the test step time out so the post-failure ProcDump capture path
has a hung packaged-app process to dump.
Snapshot dump stabilization
RNTesterApp-Fabric.cpp—DumpVisualTreenow takes up to 3 dumps with50 ms gaps and returns the first dump that matches its successor (i.e. two
consecutive dumps stringify identically). Targets composition's per-commit
rounding non-determinism on text-derived
Visual::Sizevalues (~24.5 → 24vs 25 across commits). No client / test / snapshot changes; ~100 ms added
per
dumpVisualTreecall.searchBoxhelper flakeSame flake-prone pattern duplicated across 9 sites. Updated all of them with
a single fix:
await searchBox.clearValue();beforesetValue()inside the pollcallback. Without the clear, retries append to existing text and the
getText() === inputcomparison never converges.timeout: 5000 → 10000and reducedinterval: 1500 → 500for moreretries within a longer window.
Files:
TextInputComponentTest.test.ts,AccessibilityTest.test.ts,ButtonComponentTest.test.ts,FlatListComponentTest.test.ts(×2 helpers),PointerButtonComponentTest.test.ts,SwitchComponentTest.test.ts,TouchableComponentTest.test.ts,ViewComponentTest.test.ts,RNTesterNavigation.ts(inline poll ingoToExample).Install / init / lage step retry
Yarn's own retry path (Yarn classic / Berry, and the
midgard-yarn-strictfork still in use on backport branches) only auto-retries on a small set
of network errors (
ECONNRESET/ESOCKETTIMEDOUT/ETIMEDOUT/ENOTFOUND). Other transient failures — including a fetch helper killedmid-flight (the observed mode on
0.84-stable, exit code 57005 / 0xDEAD,PR build 630484) — bypass that retry path entirely and propagate straight
up. ADO supports
retryCountOnTaskFailure: 2at the step level forexactly this case.
Added to every step on
mainthat fetches from the npm registry:.ado/build-template.yml—Strict yarn install+Build prepare-release and beachball-config..ado/prepare-release-bot.yml—yarn install+Build prepare-release and dependencies..ado/templates/strict-yarn-install.ymland.ado/templates/yarn-install.yml(the canonical per-build install)..ado/templates/react-native-init-windows.yml—creaternwapp.cmdandcreaternwlib.cmdinit steps (each runs ~6 npm/yarn fetchesinternally).
Cost when the install passes first try: zero. When it flakes: ADO retries
the step up to twice before failing the build, visible in the pipeline UI
as explicit retry attempts so genuine deterministic failures still surface
clearly within ~1 minute instead of being masked by a manual re-run cycle.
Spell-check enforcement
.cspell.json—"language": "en"→"language": "en-US". The broaderendictionary accepts both British and American spellings, which letinconsistencies (
behaviour,synthesised) drift in via reviewer / AIcontributions without surfacing as warnings.
en-USflags those forms inthe IDE so they get caught at edit time.
PR validation matrix — add
X86Hermes.ado/jobs/e2e-test.yml— addedX86Hermesto thePullRequestbuildMatrix so it now matches the
Continuousmatrix. PreviouslyX86Hermes was deferred to post-merge (since PR Revert expansion of PR flavors #8957 in Oct 2021).
Aligns the gating with what
0.84-stableand earlier release branchesalready enforce on PR, and means an X86-specific regression now blocks
merge instead of triggering a revert cycle. Trade-off: roughly doubles
E2E job wall time on PR runs and the agent capacity used per PR. We
judge that worth it given how often E2E flakes have hidden real
regressions; happy to discuss if reviewers disagree.
Screenshots
N/A — pipeline / native / test changes only.
Testing
The crash-dump mechanism, multi-dump, and
searchBoxfixes were validatedend-to-end on the equivalent 0.84-stable PR (#16045) which contains the same
code. Reproducing here for reviewer convenience:
simulateCrashForTesting=true→MaybeSimulateCrashForTestingreads the sentinel flag and dereferences anull pointer at startup →
InstallInProcessCrashDumpWriter's UEF writesfull-memory
.dmpfiles (~32 MB each) to%ProgramData%\RNW-E2E-Dumps\→ diagnostic step copies them into theartifact under
in-process/.simulateHangForTesting=true→HangForTestingpostsSleep(INFINITE)onto the UI dispatcher → jest teststep times out → post-failure ProcDump captures full-memory dumps of the
still-alive packaged app (~250 MB each) under
hang/. Confirmed dumpsride to the artifact intact for both X64Hermes and X86Hermes. With the
matrix change in this PR, both architectures now run on this PR's own
validation as well.
passed across the suite. The
searchBoxfix targets the failure mode ofthat same build (
TextInput triggers onPressIn and updates state text→"Unable to enter correct search text into test searchbox" at 5095 ms);
PR validation on this branch is the first run with the fix in place.
The crash-dump artifact format is documented inline in
$(CrashDumpRootPath)\README.md, written by the bundle step.Changelog
Should this change be included in the release notes: no
This is internal CI / test infrastructure. No runtime impact for consumers
of
react-native-windows. The only product-code change is the in-processcrash-dump writer in the E2E test app (
RNTesterApp-Fabric), which is notshipped.
Microsoft Reviewers: Open in CodeFlow