From 873cde3d5904250e283d24eabc578b272c0956ec Mon Sep 17 00:00:00 2001 From: Shahzaib Date: Sun, 28 Jun 2026 21:56:46 -0700 Subject: [PATCH] Add release-monitoring-report skill First check-in of the release-monitoring-report skill: a version-over-version release-health report generator for the Android Broker and Authenticator. Includes KQL query catalog, PowerShell/Node helpers (run-kql, bootstrap, validate, compare-versions, find-suspect-prs, fetch-appcenter-crashes), diagnostic-pattern + crash-source docs, and the HTML report template. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .../skills/release-monitoring-report/SKILL.md | 371 +++++++ .../assets/docs/crash-sources.md | 371 +++++++ .../assets/docs/investigation-patterns.md | 228 +++++ .../assets/docs/kusto-cheatsheet.md | 132 +++ .../assets/queries/README.md | 147 +++ .../auth-pn-checkforauth-completion.kql | 41 + .../auth-reacted-notification-split.kql | 34 + .../queries/auth-scenario-initiates.kql | 21 + .../queries/auth-scenario-success-rate.kql | 38 + .../assets/queries/auth-stats.kql | 43 + .../assets/queries/auth-version-resolve.kql | 22 + .../authenticator-crash-denominator.kql | 22 + .../assets/queries/broker-adoption.kql | 15 + .../assets/queries/broker-by-host-app.kql | 56 ++ .../queries/broker-error-rate-by-version.kql | 27 + .../broker-errors-by-host-app-span.kql | 48 + .../queries/broker-latency-by-version.kql | 22 + .../queries/broker-reliability-by-version.kql | 44 + .../queries/broker-top-errors-by-host-app.kql | 42 + .../queries/broker-top-errors-by-version.kql | 36 + .../assets/scripts/bootstrap-report.ps1 | 140 +++ .../assets/scripts/compare-versions.js | 134 +++ .../assets/scripts/fetch-appcenter-crashes.js | 482 +++++++++ .../assets/scripts/find-suspect-prs.ps1 | 260 +++++ .../assets/scripts/run-kql.ps1 | 101 ++ .../assets/scripts/validate-report.ps1 | 116 +++ .../assets/templates/report-template.html | 924 ++++++++++++++++++ 27 files changed, 3917 insertions(+) create mode 100644 .github/skills/release-monitoring-report/SKILL.md create mode 100644 .github/skills/release-monitoring-report/assets/docs/crash-sources.md create mode 100644 .github/skills/release-monitoring-report/assets/docs/investigation-patterns.md create mode 100644 .github/skills/release-monitoring-report/assets/docs/kusto-cheatsheet.md create mode 100644 .github/skills/release-monitoring-report/assets/queries/README.md create mode 100644 .github/skills/release-monitoring-report/assets/queries/auth-pn-checkforauth-completion.kql create mode 100644 .github/skills/release-monitoring-report/assets/queries/auth-reacted-notification-split.kql create mode 100644 .github/skills/release-monitoring-report/assets/queries/auth-scenario-initiates.kql create mode 100644 .github/skills/release-monitoring-report/assets/queries/auth-scenario-success-rate.kql create mode 100644 .github/skills/release-monitoring-report/assets/queries/auth-stats.kql create mode 100644 .github/skills/release-monitoring-report/assets/queries/auth-version-resolve.kql create mode 100644 .github/skills/release-monitoring-report/assets/queries/authenticator-crash-denominator.kql create mode 100644 .github/skills/release-monitoring-report/assets/queries/broker-adoption.kql create mode 100644 .github/skills/release-monitoring-report/assets/queries/broker-by-host-app.kql create mode 100644 .github/skills/release-monitoring-report/assets/queries/broker-error-rate-by-version.kql create mode 100644 .github/skills/release-monitoring-report/assets/queries/broker-errors-by-host-app-span.kql create mode 100644 .github/skills/release-monitoring-report/assets/queries/broker-latency-by-version.kql create mode 100644 .github/skills/release-monitoring-report/assets/queries/broker-reliability-by-version.kql create mode 100644 .github/skills/release-monitoring-report/assets/queries/broker-top-errors-by-host-app.kql create mode 100644 .github/skills/release-monitoring-report/assets/queries/broker-top-errors-by-version.kql create mode 100644 .github/skills/release-monitoring-report/assets/scripts/bootstrap-report.ps1 create mode 100644 .github/skills/release-monitoring-report/assets/scripts/compare-versions.js create mode 100644 .github/skills/release-monitoring-report/assets/scripts/fetch-appcenter-crashes.js create mode 100644 .github/skills/release-monitoring-report/assets/scripts/find-suspect-prs.ps1 create mode 100644 .github/skills/release-monitoring-report/assets/scripts/run-kql.ps1 create mode 100644 .github/skills/release-monitoring-report/assets/scripts/validate-report.ps1 create mode 100644 .github/skills/release-monitoring-report/assets/templates/report-template.html diff --git a/.github/skills/release-monitoring-report/SKILL.md b/.github/skills/release-monitoring-report/SKILL.md new file mode 100644 index 00000000..2dc71cb7 --- /dev/null +++ b/.github/skills/release-monitoring-report/SKILL.md @@ -0,0 +1,371 @@ +--- +name: release-monitoring-report +description: Generate a version-over-version release-health report for the Android Broker and/or the Authenticator app as one polished self-contained HTML file. Use this skill when monitoring a rollout — triggers include "monitor this release", "release health report", "broker rollout health", "authenticator rollout", "is this release safe to widen", "what's changing this release and why", "crash regression", "stability report", or "compare a new version vs a previous version". Accepts a Broker version and/or an Authenticator version being rolled out, plus a previous/baseline version per app (or an all-versions baseline), runs Kusto queries for each app to quantify what changed and why, adds an Authenticator crash/stability layer from App Center (crashes per 1k active devices), and writes the HTML to the android-release-reports folder under the user's home directory (outside the workspace so reports are never committed). +--- + +# Release Monitoring Report + +Produce a **version-over-version** release-health report covering the Android **Broker** and/or +the **Authenticator** app in one self-contained HTML file at +`$env:USERPROFILE\android-release-reports\release-report-broker--auth--.html` +(home folder, **outside the workspace** — reports never get committed). Omitted apps drop out +of the filename, so broker-only or authenticator-only runs are first-class. + +The report answers two questions per app: **what changed** this release (KPIs + tables vs the +baseline) and **why** (error-code movers / per-scenario deltas). It ends in a clear **verdict** +per app: SAFE to proceed / WATCH / HOLD. + +The output mirrors the canonical template at +[`assets/templates/report-template.html`](assets/templates/report-template.html) — a real filled +example kept as the structural + visual reference. The Step 1 bootstrap script copies it into +`~/android-release-reports/…` and **you edit it in place** (versions, dates, KPI values, table +rows, verdict prose). Do **not** redesign the layout. + +**Before writing any KQL, read the three reference files:** +- [`assets/docs/kusto-cheatsheet.md`](assets/docs/kusto-cheatsheet.md) — how to run queries + (auth, `run-kql.ps1` invocation, output JSON shape, token substitution, the end-to-end loop, + version-resolution recipes, hard gotchas). +- [`assets/queries/README.md`](assets/queries/README.md) — the query catalog: what each `.kql` + computes, the token convention, and the Authenticator scenario→MV→column map. +- [`assets/docs/investigation-patterns.md`](assets/docs/investigation-patterns.md) — **the diagnostic + methodology**: the count-vs-rate / version-attribution / code-frozen-control / rollout-cohort / + benign-vs-real / dimensional-decomposition / drill-to-sub-code / release-tag-diff / MV-introspection / + crash-version-attribution (P10) patterns, plus the decision flow. A KPI delta — or a crash that looks + new — is a question, not a verdict; run these before any table row becomes "regression" prose or a HOLD. + +**For the Authenticator crash/stability layer, also read:** +- [`assets/docs/crash-sources.md`](assets/docs/crash-sources.md) — App Center crash pull (auth/token, + `errorGroups` endpoint, the `nextLink` quirk), the **share-vs-rate trap**, the Kusto rate + denominator, secret handling, and the deferred Play Console (Phase 2). Authenticator only — the + Broker is not a store app and has no crash section. + +## Inputs to confirm + +Ask only for what's missing; infer the rest. + +1. **Which app(s)** — Broker version rolling out, Authenticator version rolling out, or both. + At least one is required. +2. **Baseline per app** — the previous release version to compare against, OR "auto-resolve the + previous version" (pick the next-highest-volume recent version), OR "all-versions baseline". +3. **Window** — default last **30 days** (Authenticator MVs) / **14 days** (Broker). The new + version is usually young, so a longer window mostly grows the baseline cohort. +4. **Report date** — defaults to today (used in the filename + the "Generated" banner). +5. **Authenticator crash layer (optional)** — if an Authenticator version is given and an App Center + read-only token is available (`~/.android-release-reports/appcenter.token`, `$APPCENTER_API_TOKEN`, + or `--token-file`), add the crash/stability section. Skip silently if no token. The token is a + **secret** — never echo it or write it into the report. + +If the user gives versions but not the baseline, run the resolution queries first and propose +`` (rolling out) / `` (previous) from volume, then continue. + +## Data sources (summary — full detail in the cheatsheet) + +| App | Cluster | Database | Version dim | Time col | +|-----|---------|----------|-------------|----------| +| Broker | `https://idsharedeus2.kusto.windows.net` | `ad-accounts-android-otel` | `broker_version` (`16.1.0`) | `EventInfo_Time` | +| Authenticator | `https://idsharedeus2.eastus2.kusto.windows.net` | `d496be22d62a46b0a3cf67ea2e736fd8` | `AppVersion` (`6.2606.3817`) | `EventDate` (MVs) | + +`run-kql.ps1` defaults to Broker; pass `-Cluster`/`-Database` for Authenticator. Requires +`az login` (Android Auth Client SDK security group). + +## Bundled assets + +| File | Purpose | +|---|---| +| [`templates/report-template.html`](assets/templates/report-template.html) | Canonical layout — a real filled example. **Edit in place**; do not restyle. The CSS in `` is canonical. | +| [`docs/kusto-cheatsheet.md`](assets/docs/kusto-cheatsheet.md) | Operational runbook: run-kql usage, JSON shape, token substitution, end-to-end loop, gotchas. | +| [`docs/investigation-patterns.md`](assets/docs/investigation-patterns.md) | **Diagnostic methodology** — count-vs-rate, version-attribution-vs-substitution, code-frozen control, rollout-cohort, benign-vs-real classification, dimensional decomposition, drill-to-sub-code, release-tag diff, MV introspection, **crash version-attribution (P10 — is a crash new/release-caused?)**, plus the Authenticator drill ladder (outcomes MV → `*_Errors_MV_V1` → raw `passkeyoperations`) and a decision flow. Apply before calling any delta a regression. | +| [`docs/crash-sources.md`](assets/docs/crash-sources.md) | Authenticator crash layer: App Center auth/endpoint, the three gotchas (`errorGroupId`-is-version-scoped, share-vs-rate, **`firstOccurrence`-is-rollout-date**), the **`newcrashes`/`signature` new-crash detection** flow, Kusto rate denominator, secret handling, Play Console Phase 2. | +| [`queries/README.md`](assets/queries/README.md) | Query catalog + Authenticator scenario→MV→column map. | +| [`queries/*.kql`](assets/queries/) | 6 Broker + 6 Authenticator scenario templates + `authenticator-crash-denominator.kql` (crash-rate denominator), all validated live. Substitute `` before running. Includes `broker-errors-by-host-app-span.kql` — the per-`span_name` request-rate drill-down that complements the host-app device-share movers. | +| [`scripts/run-kql.ps1`](assets/scripts/run-kql.ps1) | Direct-REST Kusto helper. `-Query`/`-Out` mandatory; `-Cluster`/`-Database` for Authenticator. | +| [`scripts/find-suspect-prs.ps1`](assets/scripts/find-suspect-prs.ps1) | Release PR correlation. **Auth-code** (eSTS) attribution: defaults to `broker/`+`common/` over a broker tag range (`-Range v16.1.0..v16.2.0`; broker uses its own tags, common maps via the broker submodule pointer). **Crash** attribution: `-Repos authenticator` over the app tag range (`-Range 6.2606.3817..6.2606.4029`) — resolves the app's own tags and parses ADO `Merged PR NNNNNNNN:` → pullrequest URLs. Three search streams: `-S` pickaxe + `-DiffGrep` (`git log -G` over diff text) + `--grep` (subject). **For a crash, set `-Symbol` to the exception/API token from the stack (e.g. `EntryPoints.get`), not the crashing class, and always pass `-DiffGrep`** — the crashing class is the victim, the culprit is a caller whose subject rarely names the subsystem. Prints PR ids + URLs for attribution cards. | +| [`scripts/fetch-appcenter-crashes.js`](assets/scripts/fetch-appcenter-crashes.js) | Pull Authenticator crash clusters from App Center → run-kql array-form JSON. `groups` (one version) + `diff` (two versions, signature-joined, per-1k rate when given Kusto denominators) + `enrich` (top signatures' daily **trend** + instance-sampled **OS-major/device-model** concentration) + **`newcrashes`** (genuinely-new java-frame signatures via anti-join against a **union of priors**, native/hex frames split out as `new-native?`) + **`signature`** (cross-version presence of one signature + trend — "is crash X version-specific?"). Captures `exceptionMessage`/`appCodeFrame`/`firstOccurrence`, drops `hidden`/`Ignored` groups, and `--page-cap 0` exhausts paging for an accurate total. | +| [`scripts/bootstrap-report.ps1`](assets/scripts/bootstrap-report.ps1) | Copy the template to a version-named file, create `_data/-/`, stamp the Generated date, prune old `_data`, detect unfilled-stub vs real-report collisions. | +| [`scripts/compare-versions.js`](assets/scripts/compare-versions.js) | Delta + classification engine over run-kql JSON. `rows` mode (version-per-row metrics) and `movers` mode (paired error-share rows). Thresholds + volume guard. | +| [`scripts/validate-report.ps1`](assets/scripts/validate-report.ps1) | Pre-publish validator: stale tokens, mojibake, raw-count leaks, version-string presence, both app sections, verdict callouts. | + +## Workflow + +### Step 1 — Bootstrap +```powershell +$S = ".github/skills/release-monitoring-report/assets/scripts" +$out = & "$S\bootstrap-report.ps1" -BrokerVersion 16.1.0 -AuthVersion 6.2606.3817 +``` +Omit either `-…Version` for a single-app run. The script prints the report path and creates the +`_data/-/` folder (`$DATA`) for raw query payloads. Re-running on an unfilled stub +is silent; a populated report halts unless you pass `-Force`. + +### Step 2 — Resolve versions +If the baseline wasn't given, run `broker-adoption.kql` and the cheap Authenticator resolver +(cheatsheet § "Version resolution recipes") to pick ``/`` by volume. For an +"all-versions baseline", drop the version filter / list every version in ``. + +### Step 3 — Pull queries (both apps) +For each needed `.kql`: read it, substitute tokens (cheatsheet § "Filling tokens"), run via +`run-kql.ps1 -Out $DATA\.json`. Run independent queries in parallel (PowerShell jobs). +Minimum useful set: +- **Broker:** `broker-adoption`, `broker-reliability-by-version`, `broker-error-rate-by-version`, + `broker-top-errors-by-version` (the "why"), `broker-latency-by-version`. +- **Broker via Authenticator (when an Authenticator version is given):** also run + `broker-by-host-app` and `broker-top-errors-by-host-app` with `=com.azure.authenticator` + and ``/`` = the **Authenticator app versions** (these MVs key the host's broker on + `AppInfo_Version`, which for that package is the app version). This isolates whether the broker + regresses *because of this app rollout*, separate from the fleet-wide `broker_version` view. +- **Authenticator:** `auth-version-resolve` (or cheap fallback), then per top scenario: + `auth-scenario-success-rate` (Registration/Auth) or `auth-pn-checkforauth-completion` (PN+CFA), + always alongside `auth-scenario-initiates` for the volume guard. Use `auth-stats` for adoption. + +### Step 3b — Authenticator crash layer (optional) +If an Authenticator version is given and an App Center token is available, **read +[`assets/docs/crash-sources.md`](assets/docs/crash-sources.md) first**, then: +1. Run `authenticator-crash-denominator.kql` (auth cluster) to get active devices for ``/``. +2. Pull + pair crashes (signature-joined; pass the two device counts so it computes the per-1k rate). + Use **`--page-cap 0`** for a verdict so the crash total isn't undercounted (there's no aggregate-total + endpoint — the total is only the pages fetched, and a busy version has 1,300+ groups): +```powershell +node "$S\fetch-appcenter-crashes.js" diff --owner authapp-t7qc ` + --app Microsoft-Authenticator-Android-Prod-App-Center ` + --version 6.2606.3817 --base 6.2605.3042 --days 14 --page-cap 0 ` + --devices-new --devices-base --out "$DATA\crash-diff.json" +``` + The diff now also carries `exceptionMessage`, `appCodeFrame` (`class.method:line`), `firstOccurrence`, + `appBuild`, `state`, and **drops team-muted `hidden`/`Ignored` groups** by default — fill these into + the crash cards. **`diff status=new` is only a CANDIDATE, not a verdict** — it means "absent from the + single `--base`," and `firstOccurrence` is the version's **rollout date**, not the signature's + app-history first-seen (a years-old crash shows a first-seen inside your window). Confirm any "new" + with step 4 below. +3. **Enrich the top movers** — the list view can't show a crash's **trend** or **which OS/OEM** it hits, + so pull both (these answer the P4/P6 patterns for crashes): +```powershell +node "$S\fetch-appcenter-crashes.js" enrich --owner authapp-t7qc ` + --app Microsoft-Authenticator-Android-Prod-App-Center ` + --version 6.2606.3817 --days 14 --top 8 --out "$DATA\crash-enrich.json" +``` + It returns, per top signature, a **daily-trend tag** (`rising`/`decaying`/`spike-then-decay`/`steady` + + peak/last day) and an instance-sampled **top OS-major + device-model concentration** (the aggregate + OS/model endpoints 404, so this samples `errorGroups/{id}/errors`). Read it as: OS-concentrated + + rising + broad-across-models ⇒ platform/release suspect; model-concentrated on rugged/obfuscated + frames ⇒ tamper/sideload, not app quality; spike-then-decay ⇒ early-rollout churn, not a HOLD. +4. **Find genuinely-new crashes (P10)** — to answer "what crashes are new in this release," anti-join + the new build's signatures against a **union of recent priors** (not just the baseline — `diff`'s + single-baseline `new` flag false-positives a crash that skipped one version). Then cross-version- + confirm any suspect with `signature`: +```powershell +# genuinely-new = java-frame signature absent from ALL priors in the 27-day window (native/hex frames +# are build-unique → tagged new-native?, judged by enrich not the anti-join). Triage on newDevices too. +node "$S\fetch-appcenter-crashes.js" newcrashes --owner authapp-t7qc ` + --app Microsoft-Authenticator-Android-Prod-App-Center ` + --version 6.2606.3817 --priors 6.2605.3042,6.2605.2973,6.2604.2550,6.2603.1485 ` + --days 14 --min-count 5 --devices-new --out "$DATA\new-crashes.json" +# is THIS crash version-specific or pre-existing? cross-version presence + trend on the new build: +node "$S\fetch-appcenter-crashes.js" signature --owner authapp-t7qc ` + --app Microsoft-Authenticator-Android-Prod-App-Center ` + --version 6.2606.3817 --priors 6.2605.3042,6.2605.2973,6.2604.2550,6.2603.1485 ` + --match "" --days 27 --trend --out "$DATA\sig.json" +``` + Read `newcrashes` as: a `genuinely-new` **java** frame that is OS-broad and holds a steady gap is a + real release suspect (prove it against the `authenticator/` tag diff, P8); a `new-native?` row, a + `newDevices=1` crash-loop, or any signature `signature` finds still firing on prior versions is + **not** release-caused. A young build inflates per-1k via a small device denominator — read the raw + **count** beside the rate. + +**Lead with the per-1k rate, not crash-share** — a signature can take a bigger share of a smaller +crash pool while its per-device rate falls (share alone invents phantom regressions). Skip this +step silently if no token exists. + +5. **Attribute a confirmed new/rising crash to its code (P10 step 6).** Once a signature is confirmed + genuinely-new (or a cross-version-confirmed rising per-1k gap) AND first-party + fleet-broad — not a + `new-native?`, tampered, single-OEM, or pre-existing frame — find the change that introduced it. + **Search the EXCEPTION TOKEN, not the crashing class:** a crash frame names the object the runtime was + inspecting (the *victim*), which is usually not the file that broke — a *caller* handed it to a failing + API. Set `-Symbol` to the exception/API token from the stack (`EntryPoints.get`, `GeneratedComponent`), + and **always** add `-DiffGrep` (a `git log -G` diff-text search) because `-GrepRegex` matches only the + commit subject and a culprit PR rarely names the subsystem it broke. Map the frame's package to its repo + (`com.microsoft.authenticator.*`/`bastion.*`/`onlineid.*` → `authenticator/`; `identity.common…` → + `common/`; broker → `broker/`; a `dagger.*`/`androidx.*` framework frame → the first-party caller in the + app repo) and correlate over the **app's own tag range** (`find-suspect-prs.ps1` resolves app tags + directly and parses ADO `Merged PR NNNNNNNN:` → pullrequest URLs): +```powershell +& "$S\find-suspect-prs.ps1" -Repos authenticator -Range 6.2606.3817..6.2606.4029 ` + -Symbol 'EntryPoints.get' -DiffGrep 'EntryPoints|GeneratedComponent' +# secondary: path-log the DI graph (a module changing what it provides breaks consumers silently): +git -C authenticator log --oneline 6.2606.3817..6.2606.4029 -- **/di/ **/dagger/ **/*Module.kt +``` + Emit a crash `code-attr` card in the **`#auth-stability`** section (Originator / Mechanism / Release + range / Likely PRs with honest confidence / Next step), `origin-app` for a confirmed first-party + regression. **Environmental is the LAST resort:** only after the exception-token pickaxe, the + `-DiffGrep` scan, AND the DI path log all come back empty consider an **OS-major × build-config** + interaction (shrinker/`targetSdk`/Play Services) — and an OS-concentration signal (e.g. "66.7% Android + 16") is **not** evidence for it (a real caller bug concentrates on the newest OS too). Verified the hard + way: a `-Symbol MfaAuthDialogActivity` search missed the `dagger.hilt.EntryPoints.get` culprit and + tempted a wrong "Android-16 × shrinker" verdict; `-Symbol 'EntryPoints.get' -DiffGrep` found PR 15896454 + ("TOTP Secret Fix") instantly. See `crash-sources.md` § "Crash → PR / code attribution" for the full + frame→repo map and the verified `dagger.hilt`/Android-16 example. + +### Step 3c — Broker-via-Authenticator span drill-down + release PR correlation (conditional) +The host-app movers table (`broker-top-errors-by-host-app`) is a **device-share** (devices hitting a +code anywhere ÷ devices on that version), which dedups a device across all spans and **masks a +per-span request-rate rise**. So a code can look flat-to-down there while it is climbing inside one +span (e.g. `invalid_grant` / `interaction_required` on the **silent** path). Whenever a server-returned +auth code is suspected — or proactively for `invalid_grant`/`interaction_required` — drill down: +1. Run `broker-errors-by-host-app-span.kql` with `=com.azure.authenticator`, + ``/`` = the app versions, and `` = the suspect codes (lower-cased, + comma-separated). It returns per-`span_name` **request-level** rates (errored ÷ total in that span). +2. If a span rate is up, separate **early-rollout churn** from a real regression with a daily trend + (rate by version by day): an upgrade spike *decays* toward baseline as the cohort re-auths; a true + regression holds a steady gap. Report the residual, not just the headline window delta. +3. **Release PR correlation** (for codes whose Originator is **eSTS** — `invalid_grant`, + `interaction_required`, `unauthorized_client`, etc. — the broker/common change is the *trigger*, + not the source of the string). The window is the **bundled broker version range**, not a date + window. Read `git log v..v` in `broker/` and `common/` **in full** (the + range between two releases is small — tens of commits), then narrow with the pickaxe: +```powershell +& "$S\find-suspect-prs.ps1" -Symbol generateAsymmetricKey -GrepRegex Asymmetric ` + -Range v16.1.0..v16.2.0 -RepoRoot "" +``` + Weight **device-PoP / PRT / token-cache** changes for silent-auth credential rejections (re-keying + or PRT churn makes eSTS reject device-bound RTs with `invalid_grant`, decaying as devices re-bind). + Exclude PRs that address a *different* code, are a *fix that reduces* errors, or are gated to an + SdkType the host app does not use (e.g. `MSAL_CPP` = OneAuth, not the Authenticator app's own MSAL). + Emit a `code-attr` attribution card (Originator / Mechanism / Release range / Likely PRs with honest + confidence / Next step) in the `#auth-broker` section. + +### Step 3d — Diagnose an Authenticator scenario move before you verdict +A KPI delta (a success-rate drop, an error-count or error-share rise) is a **question**, not a verdict. +Before any scenario row becomes "regression" prose or a HOLD, run the diagnostic ladder from +[`assets/docs/investigation-patterns.md`](assets/docs/investigation-patterns.md). Climb only as far as +the question needs: +1. **Normalize, then attribute (P1–P3).** Convert counts to a **rate per initiation** (Errors-MV + `ErrorCount` ÷ outcomes-MV `Initiated`, matched by version/day). Compare the **new-build rate vs the + old-build rate** — equal rates with a rising new-build count is rollout **substitution**, not a + regression. Then apply the **code-frozen control**: recompute the same rate on the **previous** + version; if *its* rate also rose, the cause is **environmental** (a new Android major, Play + Services / Credential Manager, server/eSTS), not this release. +2. **Decompose (P4–P6).** Re-check on **higher-volume days** (early-cohort skew + thin-day variance); + break the move by `AppVersion × OsLevel × DeviceInfoMake` via the **`*_Errors_MV_V1`** companion. + One OS across all OEMs ⇒ platform driver (often amplified by that OS's adoption growth); one OEM ⇒ + device bug; tracks the new version's ramp ⇒ substitution. Classify each reason as **benign** + (duplicate / "already registered"), **abandonment** (user/system cancel), or **true defect**, and + report a **defect-only** rate — a "drop" that is entirely benign/abandonment is **not** a quality + regression. +3. **Drill + prove (P7–P9).** For a coarse reason, drill to the structured sub-code in raw + `passkeyoperations.AllProperties` (e.g. `DeviceUnauthenticatedErrorCode` — soft codes 5/10/13/14 = + abandonment vs hard 1/7/9 = device/defect). Use `.show materialized-view | project Query` + to confirm what the metric counts before drilling. If the app is still the suspect, **prove it + against the diff**: the Authenticator is its own repo (`authenticator/`, base `working`, tags like + `6.2606.3817` / `v6.2605.3042`) — `git --no-pager diff .. -- `. + Gate logic unchanged ⇒ funnel/population/environment, not code; only a steady, code-correlated, + defect-rate gap earns a WATCH/HOLD. Name in the verdict **which** rung explained the headline delta. + +### Step 4 — Compare + classify +Feed payloads to `compare-versions.js`: +```powershell +node "$S\compare-versions.js" rows --file "$DATA\broker-reliability-by-version.json" ` + --version-col broker_version --first 16.1.0 --second 16.0.1 ` + --metrics SilentDevReliability,InteractiveDevReliability --threshold 1.0 +node "$S\compare-versions.js" movers --file "$DATA\broker-top-errors-by-version.json" ` + --key-col error_code --delta-col shareDeltaPp --lower-is-better true --top 10 +node "$S\compare-versions.js" movers --file "$DATA\crash-diff.json" ` + --key-col label --first-col basePer1k --second-col newPer1k ` + --delta-col rateDeltaPer1k --lower-is-better true --top 10 +``` +Use `--lower-is-better` for latency, error-rate, and error-share metrics (down = good). Pass +`--volume-col`/`--volume-floor` so low-volume scenarios classify as `low-volume`, not regression. + +### Step 5 — Fill the report in place +Edit the bootstrapped HTML: version strings, window dates, KPI tiles (humanize counts — `585.3M`, +not `585300000`), table rows, and a **verdict callout per app** whose prose follows what the +deltas say — and, per Step 3d, **names which diagnostic rung** explains each headline move +(volume/substitution, code-frozen ⇒ environmental, early-cohort skew, benign/abandonment, or a real +defect-rate gap); prefer a **defect-only** rate over a raw success-rate "drop" that is entirely benign +or user-cancellation. Keep both app `
`s even if one is "clean/flat". For Authenticator, fill the +`#auth-stability` section from `crash-diff.json` (crashes/1k KPIs + the per-1k crash table — surface +`exceptionMessage` and `appCodeFrame` in each row, and flag rows whose `firstOccurrence` is inside the +window as **new**) and from `crash-enrich.json` (annotate the top crash cards with the trend tag + top +OS-major/device-model so the reader sees *why* it moved, per Step 3b); fold the stability verdict into +the Authenticator callout — a spike-then-decay or single-OEM/obfuscated mover is **not** a HOLD, an +OS-concentrated rising mover broad across models is. **For any confirmed new/rising first-party crash, +add a crash `code-attr` card** to `#auth-stability` (per Step 3b.5 / P10 step 6) — Originator (`origin-app` +for a confirmed first-party regression; reserve `origin-android`+`origin-env` for a genuine OS×build-config +interaction only after the exception-token + diff-grep + DI path-log searches all come up empty) / +Mechanism (name the victim frame and the culprit caller separately) / Release range / Likely PRs with +honest confidence / Next step; model it on the canonical card in the template's `#auth-stability` section. +If no crash token was available, leave +the section's appendix note that App Center crashes were not pulled. When an Authenticator version +is given, also fill the **`#auth-broker`** section ("Broker via Authenticator") from +`broker-by-host-app.json` + `broker-top-errors-by-host-app.json`: KPI tiles (device error rate + +silent/interactive reliability for the app-hosted broker) and the per-`error_code` movers table, +with a verdict on whether the broker regresses *because of this app release*. **Always span-drill +the silent path** (Step 3c): fill the span-breakdown sub-table (`broker-errors-by-host-app-span.json`) +and, when an eSTS code is elevated, the `code-attr` attribution card — do **not** write "no broker +regression" off the device-share table alone, since it masks per-span request spikes. If the +fleet-wide Broker section flags a code that this host-scoped view does **not**, call that out +explicitly (the fleet delta is driven by another host, e.g. Link to Windows). Mark a HOLD when a +regression dominates the headline; note early-rollout caveats (the new version's cohort skews toward +upgrade/network churn — an early spike that decays is "watch," not a clear HOLD). Leave the `` CSS untouched. + +### Step 6 — Validate +```powershell +& "$S\validate-report.ps1" -Path $out -BrokerVersion 16.1.0 -AuthVersion 6.2606.3817 +``` +Fix all ERRORS (exit 1). WARNINGS are advisory. Then open the file in a browser to eyeball it. + +## Gotchas (full list in the cheatsheet) +- **Distinct devices (Broker):** `dcount_hll(hll_merge(countDevicesHll))` — never `sum(countDevices)`. +- **Percentiles (Broker):** `percentiles_array_tdigest(tdigest_merge(...))` — never average percentiles. +- **A delta is a question, not a verdict (see investigation-patterns.md):** before any Authenticator + scenario row becomes "regression," normalize counts to a **rate per initiation** (P1); compare the + **new-build rate vs the old-build rate** — a rising new-build *count* with equal rates is rollout + **substitution**, not a regression (P2); and run the **code-frozen control** — if the same rate also + rose on the **previous** version, it's environmental (OS / Play Services / Credential Manager / + eSTS), not this release (P3). Prove a real suspect against `git diff ..` in + `authenticator/` (P8) — unchanged gate logic ⇒ funnel/population, not code. +- **Benign failures inflate the "Failed" bucket:** the outcome MVs lump expected outcomes (duplicate / + "already registered") and user/system **cancellation** into `Failed`, depressing the headline + success rate without anything breaking. Decompose via the **`*_Errors_MV_V1`** companion (every + scenario has one — `Error × OsLevel × AppVersion × DeviceInfoMake`, with `ErrorCount`/`ErrorDCount`) + and report a **defect-only** rate; drill to the raw `passkeyoperations.AllProperties` sub-code (e.g. + `DeviceUnauthenticatedErrorCode`: 5/10/13/14 = abandonment, 1/7/9 = device/hard) to separate cancel + from defect. A "drop" that is entirely benign/abandonment is **not** a quality regression. +- **Authenticator outcomes:** Registration/Auth MVs have only `Initiated/Succeeded/Failed` (+DCount); + PN completion needs the init-MV ⋈ `_Results_MV_V1` join. MSA NGC vs SA splits on `IsNGC` on **both** join sides. +- **Volume guard:** < ~1K initiates = noise, not a regression. +- **Broker per host app:** the Broker MVs carry `active_broker_package_name` (host) + + `AppInfo_Version` (= the host app's version for that package, e.g. Authenticator `6.2606.3817`, + not `broker_version`). Filter `active_broker_package_name == "com.azure.authenticator"` and + compare by `AppInfo_Version` to attribute broker movement to the app rollout; never attribute a + fleet-wide `broker_version` delta to an app (Link to Windows ≈122 M devices can dominate it). +- **Crashes — share ≠ rate:** lead with crashes per 1k active devices (App Center count ÷ Kusto + devices), not crash-share; App Center `errorGroupId` is version-scoped so `diff` joins on the + crash signature, and App Center's native crash-free metric is retired. See `crash-sources.md`. +- **Crashes — page to the total, then read the message and dimensions:** there is **no aggregate-total + endpoint**, so the crash total (rate numerator + share denominator) is *only* the pages you fetch — a + busy version has 1,300+ groups, so use `--page-cap 0` for a verdict or the rate is undercounted. + `exceptionMessage` is usually the whole story (a `RemoteException` reading `validateForegroundServiceType` + = Android FGS enforcement; a `NotSerializableException` naming the class). Don't verdict a crash mover on + count alone — `enrich` its **trend** (spike-then-decay = rollout churn, not a HOLD) and **OS/model** + (broad-across-models + OS-concentrated = platform/release; one rugged OEM on obfuscated frames = tamper/sideload). +- **Crashes — "new" is the #1 false alarm (`firstOccurrence` = rollout date):** App Center's + per-version `firstOccurrence` is when that *version* shipped, NOT when the signature first appeared, + so a years-old crash shows a first-seen inside your window and a per-1k rate that "rose" (young + build = small device denominator). `diff status=new` only means "absent from the single baseline." + To actually find new crashes use **`newcrashes --priors v1,v2,v3,v4`** (anti-join vs a *union* of + priors; java-frame `genuinely-new` is the signal, native/hex `new-native?` is build-unique noise, + triage on `newDevices`), and confirm any suspect with **`signature --match … --trend`** (cross-version + presence). A signature still firing on prior versions is pre-existing/environmental, not this release. + Verified: the okhttp `FileSystem$1.rename:87` IOException looked new+regressing on 6.2606 but exists + on every version back to 6.2601. See `crash-sources.md` gotcha #3 and investigation-patterns P10. +- **Device-share masks per-span spikes:** the host-app movers table dedups a device across all spans, + so a code can look flat/down there while its per-request rate climbs inside one span. Re-slice with + `broker-errors-by-host-app-span.kql` (request-level rate by `span_name`) before declaring "no broker + regression." Separate early-rollout decay from a steady gap with a daily trend. +- **Release PR correlation:** for eSTS-returned codes (`invalid_grant`, `interaction_required`) the + trigger is in the bundled broker version range — read `git log v..v` in `broker/`+`common/` + in full, then `find-suspect-prs.ps1 -Range`. Weight device-PoP/PRT/cache changes; exclude different-code + fixes and SdkType-gated PRs (`MSAL_CPP` = OneAuth, not the Authenticator app). +- **Secrets:** the App Center token and anything under `~/.android-release-reports/` stay out of + the repo and out of the report — never echo or paste them. +- **UTF-8 trap:** never write the HTML through a PowerShell `@'…'@` heredoc (it strips emoji/arrows). + Use the `edit`/`create` tools or `[IO.File]::WriteAllText($p,$t,[System.Text.UTF8Encoding]::new($false))`. +- **Filename collision:** a populated report is never silently overwritten — bootstrap halts unless `-Force`. diff --git a/.github/skills/release-monitoring-report/assets/docs/crash-sources.md b/.github/skills/release-monitoring-report/assets/docs/crash-sources.md new file mode 100644 index 00000000..9cfdee79 --- /dev/null +++ b/.github/skills/release-monitoring-report/assets/docs/crash-sources.md @@ -0,0 +1,371 @@ +# Crash & stability sources — Authenticator release monitoring + +How to add an **Authenticator crash/stability** section to the release report. Scope is +**Authenticator only** — the Broker is not a store app and has no App Center / Play presence. +Crash data is **not** mirrored into Kusto, so it is pulled live from **App Center**; Kusto +supplies only the rate **denominator** (active devices per version). + +## TL;DR pull (two steps) + +```bash +# 1) NUMERATOR — App Center crash clusters, paired across versions (signature-joined diff) +# --page-cap 0 exhausts paging so the crash total (rate numerator) isn't undercounted. +node assets/scripts/fetch-appcenter-crashes.js diff \ + --owner authapp-t7qc --app Microsoft-Authenticator-Android-Prod-App-Center \ + --version --base --days 14 --page-cap 0 \ + --devices-new --devices-base --out crash-diff.json + +# 2) DENOMINATOR — Kusto active devices per version (run FIRST to get above) +./assets/scripts/run-kql.ps1 -Query (Get-Content assets/queries/authenticator-crash-denominator.kql -Raw) ` + -Cluster https://idsharedeus2.eastus2.kusto.windows.net ` + -Database d496be22d62a46b0a3cf67ea2e736fd8 -Out denom.json +# (substitute /// in the .kql first) + +# Rank by the honest per-1k RATE delta (lower is better): +node assets/scripts/compare-versions.js movers --file crash-diff.json \ + --key-col label --first-col basePer1k --second-col newPer1k \ + --delta-col rateDeltaPer1k --lower-is-better true --top 10 + +# 3) ENRICH the top movers — per-group daily TREND (rising/decaying/spike-then-decay) + an +# instance-sampled OS-major & device-model CONCENTRATION (the list view can't show either): +node assets/scripts/fetch-appcenter-crashes.js enrich \ + --owner authapp-t7qc --app Microsoft-Authenticator-Android-Prod-App-Center \ + --version --days 14 --top 8 --out crash-enrich.json + +# 4) NEW-CRASH SCAN — "what crashes are genuinely new in this release?" Anti-join the new build's +# signatures against a UNION of recent priors (NOT just the baseline — see gotcha #3). Defaults to +# --page-cap 0 so a "new" verdict can't miss a low-count prior occurrence. +node assets/scripts/fetch-appcenter-crashes.js newcrashes \ + --owner authapp-t7qc --app Microsoft-Authenticator-Android-Prod-App-Center \ + --version --priors ,,, \ + --days 14 --min-count 5 --devices-new --out new-crashes.json + +# 5) IS THIS CRASH VERSION-SPECIFIC? — cross-version presence of ONE signature (+ trend on the new +# build). Use to confirm a "watch"/"new" row before you verdict it. +node assets/scripts/fetch-appcenter-crashes.js signature \ + --owner authapp-t7qc --app Microsoft-Authenticator-Android-Prod-App-Center \ + --version --priors ,, \ + --match "" --days 27 --trend --out sig.json +``` + +Order matters: run the denominator query first, read the two device counts out of `denom.json`, +then pass them as `--devices-new` / `--devices-base` so the diff can compute the rate. Then run +`enrich` on the rolling-out version to get the trend + dimensions for the signatures the diff flagged. + +## Why App Center (and not Play Console) + +| Source | Per-crash detail? | Filter by version? | Verdict | +|--------|-------------------|--------------------|---------| +| **App Center Diagnostics** (`errors/errorGroups`) | **Yes** — exception type, crashing class/method/line (`codeRaw`), count, deviceCount | Yes (`?version=`) | **Use this** | +| Play Console UI | No — aggregate numbers only | Partial | Numbers only, no clusters | +| Play Console export (Reporting API / BigQuery / GCS) | No detail; needs a GCP-created service account (gated, centrally owned at Microsoft) | n/a | **Deferred — Phase 2** | + +Play Console service-account export is a future enhancement (see "Phase 2" below). For now, +App Center is the only source that returns actionable crash clusters. + +## Authentication (secret handling) + +`fetch-appcenter-crashes.js` needs an App Center **read-only User API token**. Resolution order: + +1. `--token-file ` +2. `$APPCENTER_API_TOKEN` +3. `~/.android-release-reports/appcenter.token` (default; 40-char value) + +The token is a **SECRET**. Keep it out of the repo, never echo or paste it into the report, +and never commit any file under `~/.android-release-reports/`. Create one at App Center → +**Account settings → API tokens** (read-only is sufficient). + +## App slug + +- Owner (org): `authapp-t7qc` +- App: `Microsoft-Authenticator-Android-Prod-App-Center` + +(A Dev variant and two iOS apps also exist under the same owner; use the Prod Android app.) + +## Endpoint & paging + +`GET https://api.appcenter.ms/v0.1/apps/{owner}/{app}/errors/errorGroups?version=&start=&$top=100&$orderby=count desc` +with header `X-API-Token: `. Returns `{ errorGroups: [...], nextLink }`. + +**nextLink quirk (critical):** `nextLink` comes back as a relative path **with an extra `/api` +prefix**, e.g. `/api/v0.1/apps/...&$token=`. Prefixing the host verbatim → +`https://api.appcenter.ms/api/v0.1/...` → **404**. Correct handling (already in the script): +strip the leading `/api/` → `/v0.1/...`, then prefix `https://api.appcenter.ms`. + +**Paging completeness drives total accuracy.** Page size is 100 and a busy version has **1,300+** +groups (verified: 12 pages = 1,293 groups / 46,947 crashes, *still* more pages). There is **no working +aggregate-total endpoint** for this app (see below), so the crash **total** — the denominator for +crash-share and the numerator for the per-1k rate — is *only* the sum of the pages you fetch. The +default `--page-cap` is now **12**; for a release verdict pass **`--page-cap 0`** (exhaust, hard stop +100 pages) so the total isn't silently undercounted. The script prints a `WARNING … total is +UNDERCOUNTED` to stderr when it stops with pages remaining — heed it. + +## What App Center exposes (capture map) + +The `errorGroups` list response carries more than the headline count — capture it, it's free: + +| Field | What it is | Used for | +|-------|-----------|----------| +| `count`, `deviceCount` | crashes + (sub-group) devices | rate (P1/P2) — but `deviceCount` summed across sub-groups **over-counts** (a device can hit several); it's an upper bound, so lead with crashes-per-1k, not "devices crashed". | +| `codeRaw` (`label`) | crashing class.method signature | the cross-version join key | +| `exceptionType` | e.g. `NotSerializableException`, `RemoteException` | triage | +| `exceptionMessage` | the actual message — often the whole story (e.g. a `RemoteException` whose message is `validateForegroundServiceType` = an Android FGS-type enforcement crash; a `NotSerializableException` naming the unserializable class) | root-cause | +| `exceptionClassName` + `exceptionMethod` + `exceptionLine` | precise crash site → `class.method:line` (`appCodeFrame`) | attribution. **NB** `exceptionClassMethod` / `exceptionAppCode` are **boolean flags**, *not* frame strings — don't use them as the frame. | +| `firstOccurrence` | the **version's rollout date** (NOT the signature's app-history first-seen — see gotcha #3) | a *candidate* "new", never a verdict — confirm with `newcrashes`/`signature` | +| `appBuild`, `state`, `hidden` | build no.; Open/Closed/Ignored; muted flag | filtering | + +The script now emits `appCodeFrame`, `exceptionMessage`, `firstOccurrence`, `appBuild`, `state` +(groups + diff) and **drops `hidden==true` / `state=="Ignored"` groups by default** so team-muted +noise doesn't inflate the rate (`--include-hidden` to keep them). + +### What the list view CAN'T show — the `enrich` mode + +Two diagnostics the patterns need (`investigation-patterns.md` P4/P6) are not in the list response: + +- **Per-group daily trend** — `GET …/errors/errorGroups/{id}/errorCountsPerDay?version=&start=` + → `{ errors:[{datetime,count}…] }`. Separates an **early-rollout spike that decays** from a + **sustained regression** (P4). `enrich` classifies it `rising` / `decaying` / `spike-then-decay` / + `steady` and reports the peak/last day. +- **OS-major & device-model concentration** — the aggregate `operatingSystemCounts` / `modelCounts` + endpoints **404** for this app, so dimensions come from a **capped sample** of + `GET …/errors/errorGroups/{id}/errors` instances (each carries `osVersion`, `deviceName`, `country`). + `enrich` reports the dominant OS major + model with their sampled % share (P6) — e.g. it instantly + separates *"100% Android 14, ~98% Zebra rugged devices"* (tamper/repackaging) from *"Android 16, + broad across models"* (a real platform-driven crash). + +Run it on the rolling-out version after the diff: `fetch-appcenter-crashes.js enrich --version +--top 8 --out crash-enrich.json`. Fold the trend + OS/model into each crash card; an OS-concentrated, +rising, broad-across-models signature is a platform/release suspect, while a model-concentrated one +on rugged/obfuscated frames is a tamper/sideload signal, not app quality. + +## Endpoints that DON'T work for this app (don't waste a call) + +| Endpoint | Result | Use instead | +|----------|--------|-------------| +| `errors/errorCounts`, `errors/affectedDeviceCounts`, `errors/errorCountsPerDevice` | **404** | sum paged `errorGroups` (exhaust with `--page-cap 0`) | +| `errors/errorGroups/{id}/operatingSystemCounts`, `…/modelCounts` | **404** | instance sample via `…/{id}/errors` (the `enrich` mode) | +| `errors/availableVersions` | **404** | resolve versions from a high-volume Kusto MV (queries/README) | +| **version-level** `errors/errorCountsPerDay` | **200 but drains to 0** (like retired Analytics) | the **per-group** `…/{id}/errorCountsPerDay` works — sum those if a version-level trend is needed | +| `errorGroups?errorType=handlederror` | **200 but empty** — Authenticator tracks no handled (non-fatal) errors | n/a; crashes = unhandled only, which is what we want | + +## Three gotchas the script already handles + +### 1. `errorGroupId` is version-scoped — join on the SIGNATURE +Verified empirically: **0** `errorGroupId` overlap between two versions, but **116** +`codeRaw`/`label` (crash-signature) overlap. So the cross-version `diff` joins on the +**signature** (`labelOf` = `codeRaw` / crashing frame), aggregating sub-groups that share a +frame. Never diff on `errorGroupId` — it would mark every cluster as brand-new. + +### 2. The share-vs-rate trap — ALWAYS lead with the per-1k rate +Crash **share** (a signature's % of a version's total crashes) is misleading when the two +versions' total crash pools differ in size. Real example (6.2606.3817 vs 6.2605.3042): + +- `ValidationCheckType$5.resetCache`: share **5.98% → 12.36%** (looks like a +6.4 pp regression) +- …but absolute crashes **29,052 → 1,336**, and rate **0.388 → 0.106 /1k (−73%)** — an *improvement*. + +It only took a bigger slice of a **much smaller pie** (overall rate fell 6.49 → 0.86 /1k). With +denominators supplied, `diff` derives `status` and ranking from the **per-1k rate**, not share. +Report the rate as the headline; keep share only as a *composition* signal ("what dominates the +remaining crashes"). Without denominators, status falls back to share — flag that as provisional. + +### 3. `firstOccurrence` is the version's ROLLOUT date — NOT the signature's app-history first-seen +The single most dangerous crash trap, and why `diff`'s `status=new` (absent from the **single** +baseline) is only a *candidate*, never a verdict. App Center's `firstOccurrence` is scoped to the +version-scoped group, so it equals roughly **when that version started rolling out** — a crash that +has existed for many releases shows a `firstOccurrence` *inside* your window on the new build. + +Real example (the okhttp HTTP-cache journal-rename `IOException`, +`com.android.okhttp.internal.io.FileSystem$1.rename:87`): on 6.2606.3817 its group reports +`firstOccurrence = 2026-06-11` (rollout day) and the per-1k rate rose 0.016 → 0.043 — it *looks* +brand-new and regressing. Cross-version `signature` scan proves it is present on **every** recent +version and still actively firing on each through today: + +| version | crashes (27d) | devices | last seen | +|---|--:|--:|---| +| 6.2606.3817 (new) | 1,536 | 1,208 | today | +| 6.2605.3042 (base) | 1,344 | 1,098 | today | +| 6.2604.2550 | 285 | 234 | today | +| 6.2603.1485 | 916 | 743 | today | +| 6.2602.0889 | **8,300** | 6,984 | today | +| 6.2601.0189 | 2,183 | 1,947 | today | + +So it is **pre-existing and environmental** (Android system okhttp `DiskLruCache` journal rename +failing under disk pressure), not a 6.2606 regression. The apparent "rise" is the **young-cohort +ramp + App Center upload lag** (the daily series climbs 6→418 over the window because adoption + +uploads are still filling in), and the per-1k bump is a **denominator artifact** — a comparable raw +count over the young build's much smaller active-device base. **Rule:** never call a crash new or +regressed off `firstOccurrence`-in-window or a single-baseline `diff` alone. Confirm with the +`newcrashes` anti-join (union of priors) and/or the `signature` cross-version scan, and read the raw +**count** alongside the per-1k rate (a young build inflates per-1k even when the count is in-family). + +## Is a crash NEW, or just new to this version's group? (`newcrashes` + `signature`) + +Two modes operationalize gotcha #3 so "find the new crashes in this release" is one command, not a +manual probe: + +- **`newcrashes`** — anti-joins the new build's signatures against the **union of several recent + prior versions** (pass `--priors v1,v2,v3,v4`), not just the immediate baseline. A signature is + `genuinely-new` only when it is absent from **all** listed priors within the 27-day API window AND + present on the new build. Still-active priors keep throwing structural/environmental crashes, so a + 27-day anti-join reliably catches them; a defect introduced *this* release is the residue. + - **Native/hex-frame caveat (built in):** a native crash whose only frame is a raw address + (`0x1d0c37a8 + 481192`) or a bare signal (`SIGABRT`/`SIGSEGV`/`minidump`) has a signature that + **differs per build** (the address relocates), so it *always* anti-joins as "absent from priors." + The mode tags those `frameKind=native` with verdict `new-native?` and floats the actionable + **java-frame** `genuinely-new` rows to the top. On 6.2606.3817 this separated **12** java-frame + new signatures (e.g. a cluster of `com.wolfssl…`/`org.bouncycastle…` `NoSuchFieldError` / + `NoSuchMethodError` / `OutOfMemoryError` — smells like a crypto-lib bump this release) from **133** + native-unsymbolized suspects. Judge a `new-native?` row by `enrich` (OS/model + count + trend), + **not** the signature anti-join. + - **Device count matters as much as crash count:** many "new" rows are a single device crash-looping + (high `newCount`, `newDevices=1`) — low fleet impact. Sort/triage on `newDevices`, not just `newCount`. + - It still can't see beyond the 27-day API window — a signature dormant on priors >27d ago can't be + distinguished from truly new. For java frames that's rare; corroborate a high-impact one against + the `authenticator/` `git diff ..` (P8) before you call it a release defect. +- **`signature`** — cross-version presence of **one** signature (`--match `), plus the + daily trend on the new build (`--trend`). This is the "is crash X version-specific or pre-existing?" + confirmation — run it on any `watch`/`new`/`regressed` crash row before it earns prose in the verdict. + +## Crash → PR / code attribution (for a NEW or RISING crash) + +Once P10 confirms a crash is genuinely **new** (java-frame `genuinely-new` vs a union of priors) or +**rising** (a cross-version-confirmed worsening per-1k gap, not a young-cohort/upload-lag artifact), +attribute it to the code that owns it — same idea as the Step 3c eSTS `code-attr` card, but the +mechanics differ in one decisive way: + +> **A crash names its own code.** The crashing **frame is first-party code directly** (`class.method:line`), +> so you map the frame → its repo and look for the change *in that repo's* release range. Contrast the +> eSTS path, where `invalid_grant` is a **server-returned string** and the broker/common change is only +> the upstream *trigger*. For a crash there is no server in the loop — the suspect PR lives in the repo +> that owns the crashing class. + +**Don't attribute** a `new-native?` (build-unique address) frame, an obfuscated/tampered frame +(`com.c.b.b.…`), a `newDevices=1` crash-loop, or any signature `signature` still finds firing on prior +versions — those are not release-introduced (gotcha #3 / P10). Attribution is for a confirmed, +fleet-broad, first-party new/rising signature only. + +### 1. Map the crashing frame → repo — but the frame names the VICTIM, not always the culprit + +| Frame package prefix | Owning repo | `-Repos` | +|---|---|---| +| `com.microsoft.authenticator.*`, `com.azure.authenticator.*`, `bastion.*`, `onlineid.*` | Authenticator app (`authenticator/`) | `authenticator` | +| `com.microsoft.identity.common.*` (`identity.common…`) | common (`common/`) | `common` | +| `com.microsoft.identity.broker*`, broker-service classes | broker (`broker/`) | `broker` | +| `dagger.*`, `androidx.*`, `kotlin.*`, `okhttp.*`, raw-address/native | a **framework/dep API** — the throwing frame is library code; the culprit is the **first-party CALLER** that invoked it with bad input. Search the app repo for the **API token**, not the framework. | + +> **Victim vs culprit — the rule that matters.** A crash frame names the object/API the runtime was +> *inspecting when it threw* (e.g. `…MfaAuthDialogActivity does not implement GeneratedComponent`). That +> object is often **not** the file that broke — it was *handed* to a failing API by some other code. The +> culprit is the **caller** that introduced the bad call, which can live in a completely different file +> with an unrelated PR title. Searching for the crashing class name will miss it (see §2). + +### 2. Correlate over the APP's own release range — search the EXCEPTION TOKEN, with `-DiffGrep` + +The window is the **rolling-out app version range**, expressed in the app's own tags +(`..`, e.g. `6.2606.3817..6.2606.4029`) — **not** a date window and **not** the +broker tag range. `find-suspect-prs.ps1` resolves an app/broker tag range against the repo's own tags +first, so `-Repos authenticator` with an app-tag range scans exactly that release's commits: + +```powershell +# Crash: dagger.hilt.EntryPoints.get:62 IllegalStateException on MfaAuthDialogActivity +& "$S\find-suspect-prs.ps1" -Repos authenticator -Range 6.2606.3817..6.2606.4029 ` + -Symbol 'EntryPoints.get' -DiffGrep 'EntryPoints|GeneratedComponent' +``` + +- **`-Symbol` = the exception / API token from the stack** (`EntryPoints.get`, `GeneratedComponent`), + **not** the crashing class. `git log -S` is a content pickaxe — pointed at the API token it finds the + **caller** that introduced the failing call. Pointed at the crashing class (the victim) it finds + nothing when, as is usual, the Activity itself was never edited. +- **`-DiffGrep` = the same token(s)** — runs `git log -G` over the **diff text**. This is mandatory for + crashes: `-GrepRegex` matches only the commit **subject**, and a culprit PR almost never advertises the + subsystem it broke (the real example below was titled *"TOTP Secret Fix - Phase 1"* yet added a Hilt + `EntryPoints.get` call). Subject-grep had zero chance; diff-grep caught it instantly. +- **Secondary searches** — also pickaxe the crashing class (`-Symbol MfaAuthDialogActivity`) to catch the + rarer case where the Activity *was* directly edited, and **path-log the DI graph**, not just the + crashing file: `git -C authenticator log --oneline -- **/di/ **/dagger/ **/*Module.kt`. A + module that changes what it provides (a `ContextModule` returning the wrong `Context`) breaks consumers + without touching either the consumer or the victim. +- The Authenticator repo is **ADO**; merge commits read `Merged PR NNNNNNNN: ` (8-digit PR ids, + no `#`). The script parses that and emits + `https://msazure.visualstudio.com/One/_git/AD-MFA-phonefactor-phoneApp-android/pullrequest/NNNNNNNN` URLs. + +### 3. Emit a crash `code-attr` card (in `#auth-stability`) + +Same card shape as the eSTS one (Originator / Mechanism / Release range / Likely PRs with honest +confidence / Next step), placed in the **`#auth-stability`** section after the crash table: + +- **Originator** — tag the true source: `origin-app` (red) for a confirmed first-party Authenticator-code + regression; `origin-thirdparty` for a dep bump; `origin-broker`/`origin-common` if a broker/common frame + regressed; reserve `origin-android` + `origin-env` for a genuine OS × build-config interaction (the + *last* resort — see the caveat, and do not reach for it until §2's exception-token + diff-grep + DI path + log all come back empty). +- **Mechanism** — what the runtime is doing at the frame, *which caller* fed it the bad input, and why + *this release* introduced it. Name the victim and the culprit separately. +- **Release range** — the app tag range + how you found it (exception-token pickaxe + diff-grep + DI path + log). If a naïve crashing-class search missed the culprit, say so — it documents the method. +- **Likely PRs** — each with an honest `pr-conf` badge keyed to the **caller** PR. A confirmed + `EntryPoints.get`/module-provider change that the fix later reverts is `pr-conf-high`. +- **Next step** — the landed/queued fix (PR + work item) and the regression test that pins it. + +### The build-config / environment caveat — the LAST resort, not the first + +Only after the §2 exception-token pickaxe, the `-DiffGrep` diff scan, **and** the DI-graph path log all +come back empty should you consider an environmental cause — an **OS-major × build-config** interaction +(code-shrinker DexGuard/R8 stripping/renaming a generated class, a `targetSdk` bump, a Play Services / +Credential Manager change) surfacing as that OS's adoption grows. It is real but **rare**, and an +OS-concentration signal (e.g. "66.7% Android 16") is **not** sufficient evidence for it — a genuine +caller bug can concentrate on the newest OS too (the path that triggers it is simply exercised more +there). Tag `origin-android`+`origin-env`, confidence **low**, and explicitly note you ruled out a code +caller — never let an empty *crashing-class* search alone (the easiest search to get wrong) justify an +environmental verdict. + +> **Verified example (6.2606.4029) — a caller bug that first looked environmental.** The +> `dagger.hilt.EntryPoints.get:62` `IllegalStateException` ("Given component holder class +> `…MfaAuthDialogActivity` does not implement interface `dagger.hilt.internal.GeneratedComponent`") was +> genuinely-new (0 on 6.2606.3817), 66.7% Android 16. A `-Symbol MfaAuthDialogActivity` search found +> nothing, and the OS concentration tempted an "Android-16 × DexGuard shrinker" verdict — **which was +> wrong.** Re-running with the exception token, `-Symbol 'EntryPoints.get' -DiffGrep 'EntryPoints'`, +> surfaced **PR 15896454** *("[MSRC] [110950] - TOTP Secret Fix - Phase 1"*, Cesar Acosta, `7d3da30b13`) +> on the first try. It added `OathSecretEncryptionUseCase`, which resolves its ECS dependency via +> `EntryPoints.get(applicationContext, SecureTotpEcsDependency::class.java)` — but `applicationContext` +> is bound from the **legacy Dagger `ContextModule.provideContext()`**, which returned *the context the +> component was built with*. The MFA dialog fragments build it with `requireContext()` — i.e. the +> **`MfaAuthDialogActivity`** — so `EntryPoints.get(activity, …)` ran against an Activity, which is not a +> Hilt generated component holder → the exact crash. The Activity was the **victim**; the culprit was the +> new `EntryPoints.get` caller in a different file under an unrelated PR title. Fix: **PR 16249408** +> (`023aec8abd`, "normalize ContextModule to application context", **AB#3677526**) → +> `provideContext(): Context = context.applicationContext`. This is `origin-app`, **high** confidence — a +> real release regression, not an OS/shrinker interaction. + +## App Center Analytics is RETIRED — no native crash-free % + +`analytics/crash_counts` → **410 Gone**; `analytics/crashfree_users` / `crashfree_devices` → +**404**; `session_counts` / `active_device_counts` respond but drain to ~0. So App Center cannot +give a crash-free percentage. The **only** way to a true rate is App Center crash counts +(numerator) ÷ Kusto active devices (denominator). Diagnostics/`errorGroups` itself remains alive. + +## Rate caveats (state these in the report) + +- **Population mismatch:** App Center counts devices whose App Center SDK reported a crash; + Kusto counts devices emitting product telemetry. The ratio is a directional rate, **not** an + exact crash-free %. +- **Early-rollout numerator lag:** on a freshly-rolling-out build, App Center crash uploads are + still accumulating, so a very low new-build rate can be partly an artifact — confirm the trend + as adoption grows (same caveat as the Broker early-rollout cohort). +- **"New" clusters need the union anti-join, not a single-baseline diff:** `diff`'s `status=new` + only means "absent from the one `--base` version" — a signature missing from the immediate baseline + but present two releases back is falsely flagged. Use `newcrashes --priors v1,v2,v3,v4` for a real + new-crash list, treat `frameKind=native` (`new-native?`) rows as unconfirmed (build-unique + signatures), and triage on `newDevices` (a 1-device crash-loop ≠ a fleet regression). Treat + java-frame `genuinely-new` with `newPer1k < ~0.02` as low-priority, not a HOLD. +- **Obfuscated frames** (e.g. `com.c.b.b.bSS.loadClass` → `ClassNotFoundException "Didn't find + class …"`) are typically **repackaged/tampered APKs**, not first-party bugs. Call them out as + such; their movement is about sideload/tamper prevalence, not app quality. + +## Phase 2 (deferred): Google Play Console crash export + +Adds a second store-side source. Requires a **GCP-created service account** with Play Console +access (centrally owned at Microsoft — request through the Play Console admin). Once granted, +pull either the Reporting API (`vitals.errors`) or the GCS/BigQuery crash export. Play still +gives weaker per-crash detail than App Center, so it would supplement, not replace, App Center. +Not implemented yet — leave a "Play Console: not yet wired" note in the report's appendix. diff --git a/.github/skills/release-monitoring-report/assets/docs/investigation-patterns.md b/.github/skills/release-monitoring-report/assets/docs/investigation-patterns.md new file mode 100644 index 00000000..e7bd2b5e --- /dev/null +++ b/.github/skills/release-monitoring-report/assets/docs/investigation-patterns.md @@ -0,0 +1,228 @@ +# Investigation patterns — is the move real, and is it the release? + +The diagnostic methodology the report applies whenever a metric moves between two versions. +A KPI delta is a *question*, not a verdict. Before any table row turns into "regression" prose +(or a HOLD), run the relevant patterns below to separate a **real, release-caused** regression +from volume growth, population skew, a benign outcome, or an environment shift the app didn't cause. + +These complement the broker-specific drill-downs already in the workflow (Step 3c, host-app +device-share vs per-span, release-PR correlation). The patterns here generalize to **both** apps +and are the default lens for the Authenticator scenario/crash sections. + +## The one-line tests + +| # | Pattern | The test | Reads as "not the release" when… | +|---|---------|----------|----------------------------------| +| P1 | **Count vs rate** | Normalize every error/outcome count to its initiation denominator before reacting. | Raw count is up but **rate per initiation is flat**. | +| P2 | **Version attribution vs substitution** | Compare the **rate on the new build vs the old build**, not the new build's count over time. | New-build rate ≈ old-build rate — the new version is just absorbing rollout volume. | +| P3 | **Code-frozen control** | Recompute the *same* rate on the **previous/stable version** (its code is frozen). | The rate **also rose on the frozen build** → environmental (OS / Play Services / Credential Manager / server / eSTS), not this release. | +| P4 | **Rollout-cohort effect** | Re-check on **matched higher-volume days** and after the cohort broadens. | Gap shrinks as volume grows / cohort widens → early-adopter skew, not a defect. | +| P5 | **Benign-vs-real classification** | Decompose the failure numerator by reason; split benign/expected and abandonment from true defects. | The growth is **benign** (duplicate / user-cancelled), and the **defect-only** rate is flat. | +| P6 | **Dimensional decomposition** | Break the metric by `AppVersion × OsLevel × DeviceInfoMake`. | Concentrated on **one OS across all OEMs** → platform driver; tracks the **new version's ramp** → substitution. | +| P7 | **Drill to the sub-code** | Go from the MV `Error` reason down to the raw structured sub-code. | The movement is a **soft/user code** (cancel) not a **hard** one (lockout / hw / crypto). | +| P8 | **Telemetry ↔ code via release-tag diff** | `git diff <prevTag>..<newTag>` over the feature paths to see if the **decision/gate logic actually changed**. | Gate logic is **byte-for-byte unchanged** → the shift is funnel/population/environment, not logic. | +| P9 | **MV introspection** | Read the MV definition before drilling so you query the right raw events. | (enabler for P5–P8 — tells you exactly what each metric counts.) | + +--- + +## P1 — Count vs rate +Raw error/failure **counts rise with traffic**. A bigger day, a marketing push, or a new OS cohort +all inflate counts without anything regressing. **Always** divide by the matching denominator +(initiations for that scenario; active devices for crashes) and reason about the **rate**. +- The Authenticator `*_Errors_MV_V1` views give `ErrorCount` only — pair every Errors-MV pull with + the outcomes MV's `Initiated` (or `auth-scenario-initiates.kql`) for the same version/day grain. +- If you only have a count time series and it is climbing, that is the *start* of an investigation, + never the conclusion. + +## P2 — Version attribution vs substitution +During a rollout the new build's **count** climbs simply because it is taking over traffic; the old +build's count falls in lockstep. This is **substitution**, not regression. The honest comparison is +the **per-initiation rate on the new build vs the old build over the same window**. If +`rate(new) ≈ rate(old)`, the release did not move the metric — you are watching the denominator +move. Only a genuine **rate** gap on the new build is attributable to the release. + +## P3 — Code-frozen control (the most decisive test) +The previous stable version is a **natural control**: its code is frozen, so anything that moves *its* +rate over the same window is environmental. So when the new build's rate is up: +1. Recompute the identical rate for the **previous version** over the same days. +2. If the previous build's rate **also rose**, the cause is outside the app — an OS update wave, + Google Play Services / Credential Manager change, a server/eSTS change, or a fleet-wide condition. + Do **not** attribute it to the release. +3. If the previous build's rate is **flat** while the new build's is up, the release is now a real + suspect — proceed to P5–P8. + +## P4 — Rollout-cohort effect +A young version's users are **not representative**: early adopters skew toward engaged/power users and +specific device/OS segments, and early daily volume is small (high variance). Both inflate or distort +rates (e.g. already-enrolled users disproportionately hitting "already registered"; a thin day swinging +several points on a handful of events). +- Re-measure on **matched, higher-volume days**; widen the window so the baseline cohort grows. +- Expect an upgrade-driven spike to **decay** toward baseline as the cohort re-auths/re-syncs; a true + regression holds a **steady gap**. Report the **residual after decay**, not the headline-day delta. + +## P5 — Benign-vs-real classification +An MV's `Failed`/`Succeeded` split is a **funnel outcome**, not a defect count. `Failed` routinely +bundles outcomes that are *working as intended*. Decompose the failure numerator (via the +`*_Errors_MV_V1` companion, then raw) and bucket each reason: +- **Benign / expected** — e.g. a duplicate/"already registered" outcome (the user already has the + credential). Up because more users are *already enrolled*, not because anything broke. +- **Abandonment** — user/system **cancellation** (e.g. a biometric prompt dismissed). A UX/engagement + signal, not a code defect. +- **True defect** — keystore/ECC/crypto, network, server, serialization, null-state, timeout. + +Report a **defect-only rate** (exclude benign + abandonment from the numerator). A headline +success-rate "drop" that is **entirely** benign/abandonment is not a quality regression — say so, and +keep the raw rate in an appendix for transparency. + +## P6 — Dimensional decomposition +Break the moving metric by `AppVersion × OsLevel × DeviceInfoMake` (the Errors MVs carry all three). +The shape of the concentration is the attribution: +- **Concentrated on one `OsLevel`, broad across OEMs** → an **OS-platform** driver (new Android major, + Play Services, Credential Manager), often amplified by that OS version's **adoption growth** — so + confirm with P3 (frozen build rose too) and check whether the OS cohort itself is expanding. +- **Concentrated on one `DeviceInfoMake`/model** → a **device/vendor** bug (keystore, biometric HAL). +- **Tracks the new `AppVersion`'s ramp** (up on new, down on old, rate flat) → **substitution** (P2). +- **Broad and proportional everywhere** → a real, code-caused regression candidate → P7/P8. + +## P7 — Drill to the sub-code +The Errors MV's `Error` is often a coarse bucket (e.g. a single "device auth failed" reason) that hides +very different root causes. The raw `passkeyoperations` table carries the **structured sub-code** in +`AllProperties` (e.g. `DeviceUnauthenticatedErrorCode` = the Android `BiometricPrompt` code). Drill in +to tell a **soft** outcome from a **hard** one: + +| Android `BiometricPrompt` code | Meaning | Bucket | +|---|---|---| +| 10 `ERROR_USER_CANCELED`, 13 `ERROR_NEGATIVE_BUTTON`, 14 `ERROR_NO_DEVICE_CREDENTIAL` | user dismissed | abandonment (P5) | +| 5 `ERROR_CANCELED` | system cancelled (e.g. app backgrounded) | abandonment (P5) | +| 1 `ERROR_HW_UNAVAILABLE`, 2 `ERROR_UNABLE_TO_PROCESS`, 11 `ERROR_NO_BIOMETRICS` | hardware/enrollment | true defect/device (P6) | +| 7 `ERROR_LOCKOUT`, 9 `ERROR_LOCKOUT_PERMANENT` | too many attempts | true (often device/user) | + +If ~all of a reason's growth is the soft codes, the "failure" rise is abandonment, not a defect. + +## P8 — Telemetry ↔ code via release-tag diff +When P3 says the new build's rate genuinely moved and you suspect the app, **prove it against the +diff** before naming it a code regression. The Authenticator app is its **own git repo** +(`authenticator/`, base branch `working`; tags like `6.2606.3817` and `v6.2605.3042` — note the +inconsistent `v` prefix). Diff the two release tags scoped to the feature's source paths: + +```powershell +cd authenticator +git --no-pager diff v6.2605.3042..6.2606.3817 -- ` + PhoneFactor/app/src/main/java/com/microsoft/authenticator/passkeys/ +``` + +- If the **decision/gate logic is unchanged** (the file that emits the moving reason isn't in the + diff, or only unrelated lines changed — a flight rename, a validator swap), the shift is + **funnel/population/environment**, not this release. State that explicitly. +- If the gate logic **did** change, you now have a concrete suspect commit/PR to name with honest + confidence. (For broker/common-triggered eSTS codes, use the existing Step 3c PR-correlation flow — + `find-suspect-prs.ps1 -Range v<PREV>..v<NEW>` over `broker/`+`common/`.) + +## P9 — MV introspection +Before drilling, read what a metric actually counts so you query the right raw events: + +```kusto +.show materialized-view Passkey_WebAuthN_Registration_MV_V1 | project Query +``` + +This reveals the **source table**, the `OperationName`/`requestType`/`PasskeyFlow` filters, and how +`Initiated/Succeeded/Failed` are derived — e.g. the Registration MVs count only +`CreatePasskeyCredentialRequest`, the Authentication views only `GetPasskeyCredentialRequest`. Knowing +the exact filter prevents drilling into the wrong request family when you go to the raw table. + +## P10 — Crash version-attribution (is a crash NEW / caused by this release?) +The crash analogue of P2/P3. A crash that *looks* new on the rolling-out build is the #1 false alarm, +because App Center's per-version `firstOccurrence` is the version's **rollout date**, not the +signature's app-history first-seen — so a years-old crash shows a first-seen *inside* your window and +a per-1k rate that "rose." Climb this before any crash row becomes "new"/"regressed" prose: + +1. **Anti-join against a UNION of priors, not the single baseline (the crash P2).** `diff status=new` + only means "absent from `--base`"; a signature can skip the immediate baseline yet live two releases + back. Run `fetch-appcenter-crashes.js newcrashes --version <new> --priors v1,v2,v3,v4`. Only + `genuinely-new` (absent from **all** priors in the 27-day window) is a real new-crash candidate. +2. **Discount native/hex-frame signatures.** A raw-address frame (`0x… + …`) or bare signal + (`SIGABRT`/`SIGSEGV`/`minidump`) relocates per build, so it *always* anti-joins as new — `newcrashes` + tags these `new-native?`. Judge them by `enrich` (OS/model + count + trend), never the signature. +3. **Cross-version confirm (the crash P3 / code-frozen control).** For any suspect signature run + `signature --match <frame> --version <new> --priors …`. If it is present and still firing on the + **previous** versions, it is **pre-existing/environmental** (okhttp disk-cache, OEM/OS, tamper APK), + not this release. Real example: the okhttp `FileSystem$1.rename:87` `IOException` showed + `firstOccurrence=` rollout day and per-1k 0.016→0.043 on 6.2606, but `signature` found it on every + version back to 6.2601 (peaking at 8,300 crashes on 6.2602) — a denominator/upload-lag artifact, not + a regression. +4. **Read count + devices next to the rate.** A young build's small active-device base inflates per-1k + even when the raw **count** is in-family with priors; and a high `newCount` on `newDevices=1` is a + single device crash-looping, not a fleet regression. Lead with per-1k, but corroborate with both. +5. **If a java-frame `genuinely-new` survives, prove it (P8).** A real, high-impact new java signature + (e.g. a crypto-lib `NoSuchFieldError` cluster) should correlate to a dependency/code change in the + `authenticator/` `git diff <prevTag>..<newTag>` before it earns a WATCH/HOLD. +6. **Attribute it to the CALLER, searching the exception token — not the crashing class.** A crash + frame names the object the runtime inspected when it threw (the **victim**), which is usually *not* + the file that broke — some **caller** handed it to a failing API. So set `-Symbol` to the + **exception/API token from the stack** (`EntryPoints.get`, `GeneratedComponent`), not the crashing + class, and **always** add `-DiffGrep` (a `git log -G` over the diff text) because `--grep` sees only + the commit subject and a culprit PR rarely names the subsystem it broke. Map the frame's package to + its repo (`com.microsoft.authenticator.*`/`bastion.*`/`onlineid.*` → `authenticator/`; + `identity.common…` → `common/`; broker → `broker/`; a `dagger.*`/`androidx.*` framework frame → the + first-party caller in the app repo) and correlate over the **app's own tag range**: + ```powershell + & "$S\find-suspect-prs.ps1" -Repos authenticator -Range 6.2606.3817..6.2606.4029 ` + -Symbol 'EntryPoints.get' -DiffGrep 'EntryPoints|GeneratedComponent' + # secondary: path-log the DI graph (a module changing what it provides breaks consumers silently): + git -C authenticator log --oneline 6.2606.3817..6.2606.4029 -- **/di/ **/dagger/ **/*Module.kt + ``` + Emit a crash `code-attr` card in `#auth-stability` (`origin-app` for a confirmed first-party + regression). **Environmental is the LAST resort, not the first:** only after the exception-token + pickaxe, the `-DiffGrep` scan, AND the DI path log all come back empty consider an OS × build-config + interaction (shrinker/`targetSdk`/Play Services) — and an OS-concentration signal alone is **not** + evidence for it (a real caller bug concentrates on the newest OS too). Verified the hard way: the new + `dagger.hilt.EntryPoints.get` crash on `MfaAuthDialogActivity` (6.2606.4029, 66.7% Android 16) *looked* + like an "Android-16 × shrinker" issue and a `-Symbol MfaAuthDialogActivity` search found nothing — but + `-Symbol 'EntryPoints.get' -DiffGrep` surfaced **PR 15896454** ("TOTP Secret Fix") on the first try: it + added an `EntryPoints.get(applicationContext, …)` call whose context, bound from a legacy `ContextModule`, + was actually the dialog **Activity** → the victim Activity isn't a Hilt component holder → crash (fix: + PR 16249408, AB#3677526). A real `origin-app` regression, not environmental. (Full recipe + victim-vs- + culprit rule in `crash-sources.md` § "Crash → PR / code attribution".) + +--- + +## The drill ladder (Authenticator) +Each scenario has three layers — climb down only as far as the question needs: + +1. **Outcomes MV** (`*_MV_V1`) — `Initiated/Succeeded/Failed (+DCount)` → the headline rate (P1/P2). +2. **Errors MV** (`*_Errors_MV_V1`) — `Failed` broken by `Error × OsLevel × AppVersion × DeviceInfoMake` + (`ErrorCount`/`ErrorDCount`) → reason + dimensional attribution (P5/P6). Pair with the outcomes MV + for the denominator. +3. **Raw `passkeyoperations`** — `OperationName`, `AppInfo_Version`, `DeviceInfo_OsVersion`, + `DeviceInfo_Make`, `DeviceInfo_Id`, and the `AllProperties` JSON (`RequestType`, `PasskeyFlow`, + `Error`, `ErrorSource`, `IsCrossDevice`, `DeviceUnauthenticatedErrorCode`, …) → the structured + sub-code (P7). `osLevel = tostring(split(DeviceInfo_OsVersion, " ")[0])`; `todynamic(AllProperties)` + to read keys. + +## Decision flow (apply to any moving KPI) +``` +metric moved + → P1 normalize to a rate ─ rate flat? ─────────────────► volume only, not the release + → P2 new-build rate vs old-build rate ─ equal? ────────► substitution, not the release + → P3 did the frozen (previous) build's rate move too? ─ yes? ► environmental (OS/PlayServices/eSTS) + → P4 holds on higher-volume days / after broadening? ── no? ─► early-cohort skew, re-check later + → P5 decompose reasons ─ growth is benign/abandonment? ► not a quality regression (report defect-only) + → P6 one OS across OEMs? one OEM? tracks new version? ─► platform / device / substitution + → P7 drill to sub-code ─ soft (cancel) vs hard? ───────► classify + → P8 release-tag diff ─ gate logic unchanged? ─────────► funnel/population, not code + → still a steady, code-correlated, defect rate gap? ──► REAL regression → verdict WATCH/HOLD +``` +Only the bottom rung earns regression prose. Everything above it is a reason the headline delta is +**not** a release-caused quality regression — name which one in the verdict. + +**For a moving crash** (App Center), apply **P10** instead of P5–P7: anti-join `newcrashes` against a +union of priors, discount native/hex frames, cross-version-confirm with `signature`, and read +count+devices beside the per-1k rate. A spike-then-decay trend, a pre-existing cross-version signature, +a single-OEM/obfuscated frame, or a `newDevices=1` crash-loop is **not** a HOLD; an OS-concentrated, +broad-across-models, java-frame `genuinely-new` signature that holds a steady gap is. Once confirmed +new/rising, **attribute it** (P10 step 6): set `-Symbol` to the **exception token** (not the crashing +class — that's the victim), add `-DiffGrep`, correlate over the app's own tag range with +`find-suspect-prs.ps1 -Repos authenticator`, and emit an `origin-app` crash `code-attr` card. Treat an +`origin-android`+`origin-env` (OS × build-config) verdict as the **last** resort — only after the +exception-token, diff-grep, and DI path-log searches all come back empty — never off an OS-concentration +signal or an empty crashing-class search alone. diff --git a/.github/skills/release-monitoring-report/assets/docs/kusto-cheatsheet.md b/.github/skills/release-monitoring-report/assets/docs/kusto-cheatsheet.md new file mode 100644 index 00000000..d4eab9a8 --- /dev/null +++ b/.github/skills/release-monitoring-report/assets/docs/kusto-cheatsheet.md @@ -0,0 +1,132 @@ +# Kusto operational runbook (release monitoring) + +How to actually *run* the queries and feed the results into the report. The per-query +purpose + scenario→MV→column catalog lives in [`../queries/README.md`](../queries/README.md); +this file is the mechanics. + +## Table of contents +- [Auth + prerequisites](#auth--prerequisites) +- [Running a query (run-kql.ps1)](#running-a-query-run-kqlps1) +- [The output JSON shape](#the-output-json-shape) +- [Filling tokens in a .kql before running](#filling-tokens-in-a-kql-before-running) +- [End-to-end loop](#end-to-end-loop) +- [Version resolution recipes](#version-resolution-recipes) +- [Hard gotchas](#hard-gotchas) + +## Auth + prerequisites +- `az login` must be current (`az account show`). Kusto access is via the caller's Entra token. +- Node (for `compare-versions.js`) — `node -v`. Python is NOT assumed present. +- Both clusters are read with the **same** `run-kql.ps1`; only `-Cluster`/`-Database` differ. + +## Running a query (run-kql.ps1) +`run-kql.ps1` defaults to the **Broker** cluster/db. Pipe or pass a query; capture JSON. + +```powershell +$S = ".github/skills/release-monitoring-report/assets/scripts" + +# Broker (defaults) +$q = Get-Content "$Q\broker-adoption.kql" -Raw +& "$S\run-kql.ps1" -Query $q -Out "$DATA\broker-adoption.json" + +# Authenticator (override cluster + db) +& "$S\run-kql.ps1" -Query $q -Out "$DATA\auth-mfa-pn.json" ` + -Cluster https://idsharedeus2.eastus2.kusto.windows.net ` + -Database d496be22d62a46b0a3cf67ea2e736fd8 +``` + +`run-kql.ps1` writes the JSON to the **mandatory `-Out` path** (it does not stream to stdout). +`$DATA` = the `_data/<slug>-<date>` folder `bootstrap-report.ps1` created. Keep raw payloads +there so the report is reproducible. + +## The output JSON shape +`run-kql.ps1` emits **array-form**, first row = column names: + +```json +{ "results": { "items": [ ["broker_version","devices"], ["16.1.0", 76400000], ["16.0.1", 585300000] ] } } +``` + +`compare-versions.js` reads exactly this. Do not reshape it. + +## Filling tokens in a .kql before running +Templates carry `<TOKENS>` (see queries/README → token convention). Substitute in PowerShell: + +```powershell +$q = (Get-Content "$Q\broker-top-errors-by-version.kql" -Raw). + Replace('<FIRST>','16.1.0').Replace('<SECOND>','16.0.1'). + Replace('<START>','2026-06-01').Replace('<END>','2026-06-15') +``` + +For Broker `<VERSIONS>` (multi-version filter) substitute a dynamic literal: +`.Replace('<VERSIONS>','dynamic(["16.1.0","16.0.1","16.2.0"])')`. +For `<DCOUNT>` use `true` (distinct-device columns) or `false` (raw counts). + +## End-to-end loop +1. **Resolve versions** — run `broker-adoption.kql` and the cheap Authenticator resolver + ([recipe below](#version-resolution-recipes)). Pick `<FIRST>` (rolling out) and `<SECOND>` + (previous, by volume) unless the user named them. "Baseline = all versions" → omit the + version filter / pass every version in `<VERSIONS>`. +2. **Pull** each query for both apps into `$DATA\*.json`. +3. **Compare** — feed the version-per-row payloads to `compare-versions.js rows`, and the + error-movers payload to `compare-versions.js movers --lower-is-better true` (error-share + growth is bad). See script header for flags. +4. **Fill** the bootstrapped HTML in place with the real numbers + the verdict the deltas imply. +5. **Validate** — `validate-report.ps1 -Path <file> -BrokerVersion <bv> -AuthVersion <av>`. + +## Version resolution recipes +**Authenticator (cheap, validated)** — avoid `union *`; read a high-volume MV: +```kusto +Entra_MFA_Push_Notification_And_CheckForAuth_MV_V1 +| where EventDate >= datetime(<START>) and EventDate <= datetime(<END>) +| where isnotempty(AppVersion) +| summarize Devices = sum(NotificationInitiatedDCount) by AppVersion +| order by Devices desc +``` +Newest `6.YYMM.BUILD` (highest `YYMM`) = current train. The two highest-volume recent +versions are usually `<FIRST>`/`<SECOND>`. + +**Broker** — `broker-adoption.kql` already returns `dcount` devices by `broker_version`; +sort desc and read off the top two. + +## Hard gotchas +- **Distinct devices (Broker):** `dcount_hll(hll_merge(countDevicesHll))`. Never `sum(countDevices)`. +- **Percentiles (Broker):** `percentiles_array_tdigest(tdigest_merge(responseTimeTDigest), …)`. + Never average/sum percentiles across rows. +- **Raw dims (Broker):** wrap with `MergeAccountType()` / `MergeIsSharedDevice()` if you add + account-type / shared-device filters. +- **Broker per host app:** Broker MVs (`ErrorStatsMetrics`, `*AuthStats*Metrics`, + `BrokerAdoptionStatsUpdated`) carry `active_broker_package_name` (the host app acting as broker) + and `AppInfo_Version`. For a given host package, `AppInfo_Version` **is that host app's version** + — e.g. for `com.azure.authenticator` it equals the Authenticator `AppVersion` (`6.2606.3817`), + NOT `broker_version`. To isolate "the broker as it runs inside one app's release", filter + `active_broker_package_name == "<pkg>"` and compare by `AppInfo_Version`. Don't attribute a + fleet-wide `broker_version` delta to an app — it can be dominated by another host (Link to + Windows `com.microsoft.appmanager` ≈122 M devices). +- **Device-share masks per-span spikes:** `broker-top-errors-by-host-app.kql` is a device-share + (devices hitting code X anywhere ÷ devices on that version) — it dedups a device across all spans. + A code can read flat/down there while its **per-request** rate climbs inside one `span_name` (seen: + `invalid_grant` on `AcquireTokenSilent` rose +1.19 pp while its device-share fell). Re-slice with + `broker-errors-by-host-app-span.kql` (request-level rate per span) before writing "no regression", + and separate an early-rollout spike-that-decays from a steady gap with a daily trend (rate by + version by day). For eSTS-returned codes (`invalid_grant`/`interaction_required`) correlate the + trigger to a PR in the bundled broker version range: `git log v<PREV>..v<NEW>` in `broker/`+`common/`, + then `find-suspect-prs.ps1 -Range`; weight device-PoP/PRT/cache changes. +- **Authenticator outcomes:** Registration/Auth MVs have only `Initiated/Succeeded/Failed` + (+`…DCount`) — no `Cancelled`/`PartiallySucceeded`. PN completion needs the two-table join + (init MV ⋈ `_Results_MV_V1`). +- **MSA NGC vs SA:** both the MSA PN init MV and its results MV carry `IsNGC` + (`"true"`=NGC, `"false"`=SA) — filter both join sides. +- **Volume guard:** treat scenarios with < ~1K initiates as noise, not a regression + (`compare-versions.js` `--volume-floor`). Always pull initiate volume alongside rates. +- **A moved metric is a question, not a verdict:** before calling any version-over-version delta a + regression, run the diagnostic ladder in [`investigation-patterns.md`](investigation-patterns.md) — + normalize count→rate, compare new-build vs old-build rate (substitution), the **code-frozen control** + (did the previous version's rate move too → environmental), dimensional decomposition via the + `*_Errors_MV_V1` companions, benign-vs-defect classification, raw `passkeyoperations` sub-code drill, + and the `git diff <prevTag>..<newTag>` gate-logic check. +- **Know what an MV counts before drilling:** `.show materialized-view <Name> | project Query` prints + its source table + `OperationName`/`RequestType`/`PasskeyFlow` filters — so you drill the right raw + request family. Every Authenticator scenario also has a `*_Errors_MV_V1` companion (reason × OsLevel + × AppVersion × DeviceInfoMake) for the "why". +- **UTF-8 trap:** never write report HTML through a PowerShell `@'…'@` heredoc (strips + emoji/arrows). Use `node fs.writeFileSync` or + `[IO.File]::WriteAllText($p,$t,[System.Text.UTF8Encoding]::new($false))`. diff --git a/.github/skills/release-monitoring-report/assets/queries/README.md b/.github/skills/release-monitoring-report/assets/queries/README.md new file mode 100644 index 00000000..86e83555 --- /dev/null +++ b/.github/skills/release-monitoring-report/assets/queries/README.md @@ -0,0 +1,147 @@ +# Release-monitoring query catalog + +Templates for **version-over-version** release monitoring of the Android **Broker** and the +**Authenticator** app. Every `.kql` here is a placeholder template — substitute the +`<TOKENS>` before running. All were validated against live Kusto. + +## Clusters + +| App | Cluster | Database | Version dimension | Time column | +|-----|---------|----------|-------------------|-------------| +| Broker | `https://idsharedeus2.kusto.windows.net` | `ad-accounts-android-otel` | `broker_version` (e.g. `16.1.0`) | `EventInfo_Time` | +| Authenticator | `https://idsharedeus2.eastus2.kusto.windows.net` | `d496be22d62a46b0a3cf67ea2e736fd8` | `AppVersion` (e.g. `6.2606.3817`) | `EventDate` (MVs) / `EventInfo_Time` (raw `union *`) | + +`run-kql.ps1` defaults to the Broker cluster/db. For Authenticator pass +`-Cluster https://idsharedeus2.eastus2.kusto.windows.net -Database d496be22d62a46b0a3cf67ea2e736fd8`. + +## Shared token convention + +`<FIRST>` = version **rolling out** · `<SECOND>` = **previous / baseline** version · +`<START>` `<END>` = `yyyy-mm-dd` window bounds · `<DCOUNT>` = `true` → distinct-device +(`…DCount`) columns, `false` → raw event counts · `<VERSIONS>` = a `dynamic([...])` list +used by the Broker templates that filter many versions at once. + +## Broker queries + +| File | Purpose | Key tokens | +|------|---------|-----------| +| `broker-adoption.kql` | Distinct devices per `broker_version`. **Run first** to resolve exact version strings + pick `<FIRST>`/`<SECOND>` by volume. | `<START> <END>` | +| `broker-error-rate-by-version.kql` | Headline overall **device error rate** per version (devices hitting any non-success error ÷ total devices). | `<VERSIONS> <START> <END>` | +| `broker-reliability-by-version.kql` | Silent + Interactive reliability (request and device) per version from the canonical `*AllRequestsMetrics` / `*RequestsWithoutExpectedErrorMetrics` MVs. | `<VERSIONS> <START> <END>` | +| `broker-top-errors-by-version.kql` | **The "why".** Per-`error_code` device + request counts on `<FIRST>` vs `<SECOND>` with device-share delta (pp). Top regressions/improvements. | `<FIRST> <SECOND> <START> <END>` | +| `broker-latency-by-version.kql` | P50/P75/P90/P95/P99 of `responseTime` per version (optionally one `span_name`). | `<VERSIONS> <START> <END>` | +| `broker-by-host-app.kql` | **Broker scoped to ONE host app**, compared by that app's version. Headline device error rate + silent/interactive reliability for `active_broker_package_name == <PACKAGE>`, keyed on `AppInfo_Version` (= the host app's version). | `<PACKAGE> <FIRST> <SECOND> <START> <END>` | +| `broker-top-errors-by-host-app.kql` | The "why" for the host-scoped view: per-`error_code` device-share delta for one host app's two versions. | `<PACKAGE> <FIRST> <SECOND> <START> <END>` | +| `broker-errors-by-host-app-span.kql` | **Span drill-down — the complement of the device-share movers.** Per-`span_name` **request-level** rate (errored ÷ total in that span) for a specific code list, one host app, two versions. Surfaces a per-span spike that device-share dedup hides. | `<PACKAGE> <FIRST> <SECOND> <CODES> <START> <END>` | + +**Broker attributed to a host app (e.g. Authenticator):** the Broker runs *inside* a host app, +and the Broker MVs also carry `active_broker_package_name` (the host) and `AppInfo_Version` +(which, for a given host package, **is that host app's version** — e.g. for +`com.azure.authenticator`, `AppInfo_Version == 6.2606.3817` is the Authenticator AppVersion, not +`broker_version`). Use the two `*-by-host-app.kql` templates to answer *"did the Authenticator +rollout move the broker?"* without contamination from other hosts. This matters because +fleet-wide `broker_version` deltas can be dominated by a host you are **not** shipping — e.g. Link +to Windows (`com.microsoft.appmanager`, ≈122 M devices) can swing an aggregate `io_error` figure +that has nothing to do with the Authenticator release. Top hosts by volume: `com.microsoft.appmanager`, +`com.azure.authenticator`, `com.microsoft.windowsintune.companyportal`. + +**Broker gotchas:** distinct devices = `dcount_hll(hll_merge(countDevicesHll))` — never +`sum(countDevices)`. Never sum percentiles — `percentiles_array_tdigest(tdigest_merge(...))`. +`MergeAccountType()` / `MergeIsSharedDevice()` normalize the raw dimensions if you add filters. +**Device-share masks per-span request spikes:** `broker-top-errors-by-host-app.kql` dedups a device +across all spans, so a code can read flat/down there while its per-request rate climbs inside one +span (e.g. `invalid_grant` on `AcquireTokenSilent`). When an eSTS code is suspected, re-slice with +`broker-errors-by-host-app-span.kql` and separate early-rollout decay from a steady gap via a daily +trend before concluding. The trigger PR lives in the bundled broker version range — correlate with +`assets/scripts/find-suspect-prs.ps1 -Range v<PREV>..v<NEW>`. + +## Authenticator queries + +| File | Purpose | Applies to | +|------|---------|-----------| +| `auth-version-resolve.kql` | Resolve candidate `AppVersion`s (newest `yymm` = current train). Auto-detect `<FIRST>`/`<SECOND>`. Uses `union *` (heavy) — prefer the cheap fallback below if it is slow. | all | +| `auth-scenario-success-rate.kql` | Per-version Initiated/Succeeded/Failed + SuccessRate. The headline per scenario. | single-MV **Registration / Authentication** scenarios | +| `auth-scenario-initiates.kql` | Per-version initiate volume (guards against reading noise as a regression). | any scenario (swap `<INIT_COL>`) | +| `auth-pn-checkforauth-completion.kql` | Two-table join: notifications initiated vs results reaching a terminal `FinalResult`. CompletionRate / DropRate. | **PN + CheckForAuth** families (MFA / PSI / MSA) | +| `auth-reacted-notification-split.kql` | Approved / Denied / Error split of reacted notifications. | **PN + CheckForAuth Results** families | +| `auth-stats.kql` | Fleet/adoption stats: total devices, adoption-over-time, DAU, version share, OEM/OS/Country. Raw `union *`. | app-wide | +| `authenticator-crash-denominator.kql` | Active devices for `<FIRST>` **and** `<SECOND>` in one query — the denominator for crashes-per-1k-active-devices (numerator from App Center). | crash/stability layer | + +### Crash / stability (Authenticator) + +Crash clusters are **not** in Kusto — pull them from **App Center** with +`assets/scripts/fetch-appcenter-crashes.js`, then divide by the device counts from +`authenticator-crash-denominator.kql` for an honest crashes-per-1k rate. Read +`assets/docs/crash-sources.md` first (auth/token, the `errorGroupId`-is-version-scoped and +share-vs-rate gotchas, App Center Analytics is retired, secret handling, Play Console Phase 2). + +### Cheap version-resolution fallback + +`union *` in `auth-version-resolve.kql` scans every table. If it is slow, resolve versions +from a high-volume MV instead (validated): + +```kusto +Entra_MFA_Push_Notification_And_CheckForAuth_MV_V1 +| where EventDate >= datetime(<START>) and EventDate <= datetime(<END>) +| where isnotempty(AppVersion) +| summarize Devices = sum(NotificationInitiatedDCount) by AppVersion +| order by Devices desc +``` + +### Authenticator scenario → MV → column catalog + +Outcome columns each have a `…DCount` distinct-device twin. **Registration / Authentication +MVs expose only `Initiated / Succeeded / Failed (+DCount)` and `TotalUniqueDevices` — there is +NO `Cancelled` / `PartiallySucceeded` column.** PN MVs carry only an initiated counter; the +terminal outcome lives in the paired `_Results_MV_V1`. + +| Scenario | Registration/Auth MV (success-rate) | Initiate column | PN init MV | PN init column | PN results MV (`FinalResult`) | results init column | +|----------|-------------------------------------|-----------------|-----------|----------------|------------------------------|---------------------| +| Passkey WebAuthN Reg | `Passkey_WebAuthN_Registration_MV_V1` | `Initiated` | — | — | — | — | +| Passkey InApp Reg | `Passkey_InApp_Registration_MV_V1` | `Initiated` | — | — | — | — | +| Passkey WebAuthN Auth | `Passkey_WebAuthN_Authentication_MV_V1` | `Initiated` | — | — | — | — | +| Entra MFA Reg (QR) | `Entra_MFA_Registration_QR_Code_Flow_MV_V1` | `Initiated` | — | — | — | — | +| Entra MFA Reg (Manual/Non-QR) | `Entra_MFA_Registration_Manual_Flow_MV_V1` + `Entra_MFA_Registration_Non_QR_Code_Flow_MV_V1` | `Initiated` | — | — | — | — | +| Entra MFA PN+CFA | — | — | `Entra_MFA_Push_Notification_And_CheckForAuth_MV_V1` | `NotificationInitiated` | `Entra_MFA_Push_Notification_And_CheckForAuth_Results_MV_V1` | `RequestTimeInitiated` | +| Entra PSI Reg | `Entra_PSI_Registration_MV_V1` | `Initiated` | — | — | — | — | +| Entra PSI PN-Reg | `Entra_PSI_Push_Notification_Registration_MV_V1` | `RegistrationStarted` | — | — | — | — | +| Entra PSI PN+CFA | — | — | `Entra_PSI_Push_Notification_And_CheckForAuth_MV_V1` | `NotificationInitiated` | `Entra_PSI_Push_Notification_And_CheckForAuth_Results_MV_V1` | `RequestTimeInitiated` | +| MSA NGC Reg | `Entra_MSA_NGC_Registration_MV_V1` | `Initiated` | — | — | — | — | +| MSA SA Reg | `Entra_MSA_SA_Registration_MV_V1` | `Initiated` | — | — | — | — | +| MSA NGC/SA PN+CFA | — | — | `Entra_MSA_Push_Notification_And_CheckForAuth_MV_V1` | `NotificationReceivedInitiated` | `Entra_MSA_Push_Notification_And_CheckForAuth_Results_MV_V1` | `SessionTimeInitiated` | + +**MSA NGC vs SA split:** the MSA PN init MV **and** its results MV both carry `IsNGC` +(`"true"` → NGC, `"false"` → SA). Apply the same `| where IsNGC == "..."` filter on both +sides of the join. + +`FinalResult` ∈ {`Approved`, `Denied`, `Error`}. Completion = Approved+Denied ÷ initiated. + +### Drilling below the outcome MVs (the "why" behind a moved metric) + +The outcome MVs answer *what* (rate up/down); they do **not** explain *why*. Two layers sit beneath +them — climb down per [`../docs/investigation-patterns.md`](../docs/investigation-patterns.md): + +1. **`*_Errors_MV_V1` companion (reason + dimension).** Essentially every scenario has one, named by + inserting `Errors` into the outcome MV name — e.g. `Passkey_WebAuthN_Registration_MV_V1` → + `Passkey_WebAuthN_Registration_Errors_MV_V1`, `Passkey_WebAuthN_Authentication_MV_V1` → + `Passkey_WebAuthN_Authentication_Errors_MV_V1` (PN families use `…_And_CheckForAuth_Errors_MV_V1`). + Schema is uniform: `EventDate, Error, OsLevel, AppVersion, DeviceInfoMake, ErrorCount, ErrorDCount, + TotalUniqueDevices`. This is the **reason breakdown of `Failed`**, already sliced by OS major and + OEM — exactly the dimensional decomposition (P6) and benign-vs-real classification (P5) the patterns + need. It carries **counts only**, so always pair it with the outcome MV's `Initiated` for the rate. + +2. **Raw `passkeyoperations` (structured sub-code).** When `Error` is a coarse bucket, the raw table + has the finer code. Key fields: `OperationName` (`PasskeyCredentialRequest{Initiated,Succeeded, + Failed}`, plus sub-operations like `PasskeyBeginGetCredential*`), `AppInfo_Version`, + `DeviceInfo_OsVersion` (`osLevel = tostring(split(DeviceInfo_OsVersion," ")[0])`), `DeviceInfo_Make`, + `DeviceInfo_Id`, `EventInfo_Time`, and `AllProperties` (JSON string — `todynamic()` it). Useful + `AllProperties` keys: `RequestType` (`CreatePasskeyCredentialRequest` = registration, + `GetPasskeyCredentialRequest` = authentication), `PasskeyFlow` + (`WEB_AUTH_N_REGISTRATION`/`WEB_AUTH_N_AUTHENTICATION`/`IN_APP_REGISTRATION`), `Error`, + `ErrorSource`, `IsCrossDevice`, `DeviceUnauthenticatedErrorCode` (Android `BiometricPrompt` code — + 5/10/13/14 = abandonment, 1/7/9 = device/hard), `DeviceUnauthenticatedErrorMessage`, `Source`. + +3. **Know what a metric counts before you drill (P9):** `.show materialized-view <Name> | project Query` + reveals the source table and the `OperationName`/`RequestType`/`PasskeyFlow` filters — e.g. + Registration MVs count only `CreatePasskeyCredentialRequest`, Authentication only + `GetPasskeyCredentialRequest` — so you query the right request family in the raw table. diff --git a/.github/skills/release-monitoring-report/assets/queries/auth-pn-checkforauth-completion.kql b/.github/skills/release-monitoring-report/assets/queries/auth-pn-checkforauth-completion.kql new file mode 100644 index 00000000..b736a340 --- /dev/null +++ b/.github/skills/release-monitoring-report/assets/queries/auth-pn-checkforauth-completion.kql @@ -0,0 +1,41 @@ +// Authenticator — per-version PUSH-NOTIFICATION COMPLETION RATE (two-table join). +// Generalizes the dashboard's "PN + CheckForAuth Success Rate" / "Push Notification +// Completion Rate" tiles for Entra MFA, Entra PSI, and MSA NGC/SA. +// Completion = (results where FinalResult in Approved|Denied) / (notifications initiated). +// +// Cluster: https://idsharedeus2.eastus2.kusto.windows.net +// Database: d496be22d62a46b0a3cf67ea2e736fd8 +// +// Per-family table + column wiring (set <INIT_MV> <INIT_COL> <RESULT_MV> <RESULT_COL> <FAMILY_FILTER>): +// Entra MFA : <INIT_MV>=Entra_MFA_Push_Notification_And_CheckForAuth_MV_V1 +// <INIT_COL>=NotificationInitiated <RESULT_MV>=Entra_MFA_Push_Notification_And_CheckForAuth_Results_MV_V1 +// <RESULT_COL>=RequestTimeInitiated <FAMILY_FILTER>=(no filter) +// Entra PSI : <INIT_MV>=Entra_PSI_Push_Notification_And_CheckForAuth_MV_V1 +// <INIT_COL>=NotificationInitiated <RESULT_MV>=Entra_PSI_Push_Notification_And_CheckForAuth_Results_MV_V1 +// <RESULT_COL>=RequestTimeInitiated <FAMILY_FILTER>=(no filter) +// MSA NGC : <INIT_MV>=Entra_MSA_Push_Notification_And_CheckForAuth_MV_V1 +// <INIT_COL>=NotificationReceivedInitiated <RESULT_MV>=Entra_MSA_Push_Notification_And_CheckForAuth_Results_MV_V1 +// <RESULT_COL>=SessionTimeInitiated <FAMILY_FILTER>=| where IsNGC == "true" +// MSA SA : same as NGC but <FAMILY_FILTER>=| where IsNGC == "false" +// +// Placeholders: <FIRST> <SECOND> <START> <END> <DCOUNT> and the DCount column names +// follow the "<col>DCount" convention. +let dc = <DCOUNT>; +let initiated = <INIT_MV> + <FAMILY_FILTER> + | where EventDate >= startofday(datetime(<START>)) and EventDate <= endofday(datetime(<END>)) + | where AppVersion in ("<FIRST>", "<SECOND>") + | summarize NotificationInitiated = iff(dc, sum(<INIT_COL>DCount), sum(<INIT_COL>)) by AppVersion; +let completed = <RESULT_MV> + <FAMILY_FILTER> + | where EventDate >= startofday(datetime(<START>)) and EventDate <= endofday(datetime(<END>)) + | where AppVersion in ("<FIRST>", "<SECOND>") + | where FinalResult == "Approved" or FinalResult == "Denied" + | summarize Completed = iff(dc, sum(<RESULT_COL>DCount), sum(<RESULT_COL>)) by AppVersion; +initiated +| join kind=inner completed on AppVersion +| extend + CompletionRate = round(todouble(Completed) / todouble(NotificationInitiated) * 100.0, 2), + DropRate = round((todouble(NotificationInitiated) - todouble(Completed)) / todouble(NotificationInitiated) * 100.0, 2) +| project AppVersion, NotificationInitiated, Completed, CompletionRate, DropRate +| order by AppVersion desc diff --git a/.github/skills/release-monitoring-report/assets/queries/auth-reacted-notification-split.kql b/.github/skills/release-monitoring-report/assets/queries/auth-reacted-notification-split.kql new file mode 100644 index 00000000..e055478a --- /dev/null +++ b/.github/skills/release-monitoring-report/assets/queries/auth-reacted-notification-split.kql @@ -0,0 +1,34 @@ +// Authenticator — per-version REACTED-NOTIFICATION outcome split (Approved/Denied/Error). +// Generalizes the dashboard's "Reacted Notification Success Rate" / "PN Reacted +// Notification Success Rate" tiles (Entra MFA, Entra PSI, MSA NGC/SA). Splits the +// notifications the user reacted to into Approved / Denied / Error percentages. +// +// Cluster: https://idsharedeus2.eastus2.kusto.windows.net +// Database: d496be22d62a46b0a3cf67ea2e736fd8 +// +// Per-family wiring (set <RESULT_MV> <REACT_COL> <FAMILY_FILTER>): +// Entra MFA : <RESULT_MV>=Entra_MFA_Push_Notification_And_CheckForAuth_Results_MV_V1 <REACT_COL>=RequestTimeInitiated <FAMILY_FILTER>=(none) +// Entra PSI : <RESULT_MV>=Entra_PSI_Push_Notification_And_CheckForAuth_Results_MV_V1 <REACT_COL>=RequestTimeInitiated <FAMILY_FILTER>=(none) +// MSA NGC : <RESULT_MV>=Entra_MSA_Push_Notification_And_CheckForAuth_Results_MV_V1 <REACT_COL>=SessionTimeInitiated <FAMILY_FILTER>=| where IsNGC == "true" +// MSA SA : same as NGC but <FAMILY_FILTER>=| where IsNGC == "false" +// +// Placeholders: <FIRST> <SECOND> <START> <END> <DCOUNT>. +let dc = <DCOUNT>; +<RESULT_MV> +| where FinalResult in ("Approved", "Denied", "Error") +<FAMILY_FILTER> +| where EventDate >= startofday(datetime(<START>)) and EventDate <= endofday(datetime(<END>)) +| where AppVersion in ("<FIRST>", "<SECOND>") +| summarize Reacted = iff(dc, sum(<REACT_COL>DCount), sum(<REACT_COL>)) by AppVersion, FinalResult +| summarize + Approved = sumif(Reacted, FinalResult == "Approved"), + Denied = sumif(Reacted, FinalResult == "Denied"), + Error = sumif(Reacted, FinalResult == "Error") + by AppVersion +| extend Total = Approved + Denied + Error +| extend + ApprovedRate = round(todouble(Approved) / todouble(Total) * 100, 2), + DeniedRate = round(todouble(Denied) / todouble(Total) * 100, 2), + ErrorRate = round(todouble(Error) / todouble(Total) * 100, 2) +| project AppVersion, Total, Approved, Denied, Error, ApprovedRate, DeniedRate, ErrorRate +| order by AppVersion desc diff --git a/.github/skills/release-monitoring-report/assets/queries/auth-scenario-initiates.kql b/.github/skills/release-monitoring-report/assets/queries/auth-scenario-initiates.kql new file mode 100644 index 00000000..c2df3cf6 --- /dev/null +++ b/.github/skills/release-monitoring-report/assets/queries/auth-scenario-initiates.kql @@ -0,0 +1,21 @@ +// Authenticator — per-version INITIATES (volume) for a single-MV scenario. +// Generalizes the dashboard's "<Scenario> Initiates" tiles. Use alongside the +// success-rate query so the report can show whether a success-rate move is real or a +// volume artifact (a regression on 5 initiates is noise; on 50k it is real). +// +// Cluster: https://idsharedeus2.eastus2.kusto.windows.net +// Database: d496be22d62a46b0a3cf67ea2e736fd8 +// +// Placeholders: <SCENARIO_MV> <FIRST> <SECOND> <START> <END> <DCOUNT> (see README catalog). +// The initiate column differs per scenario family: +// most Registration/Auth MVs: Initiated / InitiatedDCount +// PSI PN-Registration MV: RegistrationStarted / RegistrationStartedDCount +// *_Push_Notification_And_CheckForAuth_MV_V1 (MFA/PSI): NotificationInitiated / NotificationInitiatedDCount +// MSA *_Push_Notification_And_CheckForAuth_MV_V1: NotificationReceivedInitiated / NotificationReceivedInitiatedDCount +// Swap <INIT_COL>/<INIT_DCOUNT_COL> accordingly. +let dc = <DCOUNT>; +<SCENARIO_MV> +| where EventDate >= startofday(datetime(<START>)) and EventDate <= endofday(datetime(<END>)) +| where AppVersion in ("<FIRST>", "<SECOND>") +| summarize Initiates = iff(dc, sum(<INIT_DCOUNT_COL>), sum(<INIT_COL>)) by AppVersion +| order by AppVersion desc diff --git a/.github/skills/release-monitoring-report/assets/queries/auth-scenario-success-rate.kql b/.github/skills/release-monitoring-report/assets/queries/auth-scenario-success-rate.kql new file mode 100644 index 00000000..d3445020 --- /dev/null +++ b/.github/skills/release-monitoring-report/assets/queries/auth-scenario-success-rate.kql @@ -0,0 +1,38 @@ +// Authenticator — per-version SUCCESS RATE for a single-MV scenario. +// Generalizes the dashboard's outcome-distribution tiles (Passkey WebAuthN / InApp / +// Authentication, MFA QR + No-QR, PSI Registration + PN-Registration, MSA NGC + SA +// Registration). Produces ONE aggregate row per version over the whole window — the +// headline number the release report compares (first_target vs second_target). +// +// Cluster: https://idsharedeus2.eastus2.kusto.windows.net +// Database: d496be22d62a46b0a3cf67ea2e736fd8 +// +// Placeholders: +// <SCENARIO_MV> one of the scenario materialized views (see queries/README.md catalog). +// For "No QR" MFA pass: union Entra_MFA_Registration_Manual_Flow_MV_V1, Entra_MFA_Registration_Non_QR_Code_Flow_MV_V1 +// <FIRST> rolling-out version (e.g. 6.2606.3817) +// <SECOND> previous/baseline version (e.g. 6.2605.3042) +// <START><END> datetime bounds (yyyy-mm-dd) +// <DCOUNT> true -> distinct-device counts (…DCount columns) +// false -> raw event counts (default the dashboard uses for success rate) +// +// NOTE: registration MVs expose exactly Initiated / Succeeded / Failed (+ …DCount twins) +// and TotalUniqueDevices — there is no Cancelled / PartiallySucceeded column, so +// SuccessRate = Succeeded / Initiated and any shortfall is bucketed as "Unknown" +// (notification shown but no terminal result in-window). See queries/README.md catalog. +let dc = <DCOUNT>; +<SCENARIO_MV> +| where EventDate >= startofday(datetime(<START>)) and EventDate <= endofday(datetime(<END>)) +| where AppVersion in ("<FIRST>", "<SECOND>") +| summarize + Initiated = iff(dc, sum(InitiatedDCount), sum(Initiated)), + Succeeded = iff(dc, sum(SucceededDCount), sum(Succeeded)), + Failed = iff(dc, sum(FailedDCount), sum(Failed)) + by AppVersion +| extend Unknown = case(Initiated > (Succeeded + Failed), Initiated - (Succeeded + Failed), 0) +| extend + SuccessRate = round(case(Initiated > 0, todouble(Succeeded) / todouble(Initiated) * 100, 0.0), 2), + FailureRate = round(case(Initiated > 0, todouble(Failed) / todouble(Initiated) * 100, 0.0), 2), + UnknownRate = round(case(Initiated > 0, todouble(Unknown) / todouble(Initiated) * 100, 0.0), 2) +| project AppVersion, Initiated, Succeeded, Failed, SuccessRate, FailureRate, UnknownRate +| order by AppVersion desc diff --git a/.github/skills/release-monitoring-report/assets/queries/auth-stats.kql b/.github/skills/release-monitoring-report/assets/queries/auth-stats.kql new file mode 100644 index 00000000..31ce3545 --- /dev/null +++ b/.github/skills/release-monitoring-report/assets/queries/auth-stats.kql @@ -0,0 +1,43 @@ +// Authenticator — release adoption + fleet stats. +// Mirrors the dashboard "Stats (Non MV Based)" page. All run over raw `union *` events +// (columns: EventInfo_Time, DeviceInfo_Id, AppInfo_Version, DeviceInfo_Make, +// DeviceInfo_OsVersion, PipelineInfo_ClientCountry). Use to gauge how far the +// rolling-out release has been picked up and on which devices/OS/geo. +// +// Cluster: https://idsharedeus2.eastus2.kusto.windows.net +// Database: d496be22d62a46b0a3cf67ea2e736fd8 +// +// Placeholders: <FIRST> (rolling-out version), <START>, <END>. + +// --- (a) Total devices on the rolling-out release over the window ------------------ +union * +| where EventInfo_Time >= startofday(datetime(<START>)) and EventInfo_Time <= endofday(datetime(<END>)) +| where AppInfo_Version == "<FIRST>" +| summarize TotalDevices = dcount(DeviceInfo_Id) + +// --- (b) Rolling-out release adoption over time (daily device dcount) -------------- +// union * +// | where EventInfo_Time >= startofday(datetime(<START>)) and EventInfo_Time <= endofday(datetime(<END>)) +// | where AppInfo_Version == "<FIRST>" +// | summarize Devices = dcount(DeviceInfo_Id) by bin(EventInfo_Time, 1d) +// | order by EventInfo_Time asc + +// --- (c) Daily active users, all versions (context denominator) -------------------- +// union * +// | where EventInfo_Time >= startofday(datetime(<START>)) and EventInfo_Time <= endofday(datetime(<END>)) +// | summarize DAU = dcount(DeviceInfo_Id) by bin(EventInfo_Time, 1d) +// | order by EventInfo_Time asc + +// --- (d) Top app-version share (which versions hold the fleet right now) ------------ +// union * +// | where EventInfo_Time >= startofday(datetime(<START>)) and EventInfo_Time <= endofday(datetime(<END>)) +// | summarize Devices = dcount(DeviceInfo_Id) by AppVersion = AppInfo_Version +// | top 15 by Devices + +// --- (e) OEM / OS / Country slices for the rolling-out release (skew detection) ----- +// union * +// | where EventInfo_Time >= startofday(datetime(<START>)) and EventInfo_Time <= endofday(datetime(<END>)) +// | where AppInfo_Version == "<FIRST>" +// | extend OsLevel = tostring(split(DeviceInfo_OsVersion, " ")[0]) +// | summarize Devices = dcount(DeviceInfo_Id) by OsLevel +// | top 10 by Devices diff --git a/.github/skills/release-monitoring-report/assets/queries/auth-version-resolve.kql b/.github/skills/release-monitoring-report/assets/queries/auth-version-resolve.kql new file mode 100644 index 00000000..499b4bcd --- /dev/null +++ b/.github/skills/release-monitoring-report/assets/queries/auth-version-resolve.kql @@ -0,0 +1,22 @@ +// Authenticator — resolve candidate release versions (most-used first). +// Use this to AUTO-DETECT the rolling-out version + previous release when the user +// did not supply explicit versions. Derived from the dashboard's proven +// "Top 15 App Version User DCount" tile (union * over AppInfo_Version), so the +// columns are guaranteed to exist. +// +// Cluster: https://idsharedeus2.eastus2.kusto.windows.net +// Database: d496be22d62a46b0a3cf67ea2e736fd8 (Authenticator OTEL) +// +// Authenticator versions are "6.YYMM.BUILD". Newest YYMM = current train; prior YYMM +// = previous release. A staged rollout / hotfix can put two BUILDs in the same YYMM — +// keep both. Pick the two newest YYMM groups for first_target / second_target. +union * +| where EventInfo_Time >= startofday(datetime(<START>)) and EventInfo_Time <= endofday(datetime(<END>)) +| where isnotempty(AppInfo_Version) +| extend parts = split(AppInfo_Version, ".") +| where array_length(parts) == 3 and isnotnull(toint(parts[1])) +| summarize Devices = dcount(DeviceInfo_Id) by AppInfo_Version, + yymm = toint(parts[1]), build = toint(parts[2]) +| where Devices > 100 // drop dev/test/noise builds +| order by yymm desc, build desc +| take 12 diff --git a/.github/skills/release-monitoring-report/assets/queries/authenticator-crash-denominator.kql b/.github/skills/release-monitoring-report/assets/queries/authenticator-crash-denominator.kql new file mode 100644 index 00000000..5b078805 --- /dev/null +++ b/.github/skills/release-monitoring-report/assets/queries/authenticator-crash-denominator.kql @@ -0,0 +1,22 @@ +// Authenticator — CRASH-RATE DENOMINATOR. Distinct active devices for the rolling-out +// version and the baseline, in one shot, so a crashes-per-1k-active-devices rate can be +// computed against the App Center crash NUMERATOR (fetch-appcenter-crashes.js). +// +// crashesPer1k = 1000.0 * appCenterCrashCount(version) / ActiveDevices(version) +// +// Cluster: https://idsharedeus2.eastus2.kusto.windows.net +// Database: d496be22d62a46b0a3cf67ea2e736fd8 +// +// Placeholders: <FIRST> (rolling-out AppVersion), <SECOND> (previous/baseline), <START>, <END>. +// +// CAVEAT — the two populations are NOT identical: App Center counts devices whose App Center +// SDK reported a crash; Kusto counts devices emitting product telemetry. The ratio is a +// useful directional rate, not an exact crash-free percentage. Prefer the within-version +// crash-SHARE delta (already cohort-normalized) as the headline, and use this rate only to +// sanity-check that a share move is not just a cohort-size artifact. App Center's own +// crash-free metric is RETIRED, which is why the denominator must come from here. +union * +| where EventInfo_Time >= startofday(datetime(<START>)) and EventInfo_Time <= endofday(datetime(<END>)) +| where AppInfo_Version in ("<FIRST>", "<SECOND>") +| summarize ActiveDevices = dcount(DeviceInfo_Id) by AppVersion = AppInfo_Version +| order by ActiveDevices desc diff --git a/.github/skills/release-monitoring-report/assets/queries/broker-adoption.kql b/.github/skills/release-monitoring-report/assets/queries/broker-adoption.kql new file mode 100644 index 00000000..d9a5da71 --- /dev/null +++ b/.github/skills/release-monitoring-report/assets/queries/broker-adoption.kql @@ -0,0 +1,15 @@ +// Broker — ADOPTION / version resolution. Distinct devices per broker_version over the +// window. Doubles as the Broker version-resolution query: run it first to discover the +// exact broker_version strings and pick the rolling-out version + previous version by +// device volume. Also feeds the Broker adoption section of the report. +// +// Cluster: https://idsharedeus2.kusto.windows.net +// Database: ad-accounts-android-otel +// +// Placeholders: <START>, <END>. +// HLL gotcha: dcount_hll(hll_merge(countDevicesHll)) — never sum countDevices. +materialized_view('BrokerAdoptionStatsUpdated') +| where EventInfo_Time between (datetime(<START>) .. datetime(<END>)) +| summarize Devices = dcount_hll(hll_merge(countDevicesHll)) by broker_version +| where Devices > 50 +| order by Devices desc diff --git a/.github/skills/release-monitoring-report/assets/queries/broker-by-host-app.kql b/.github/skills/release-monitoring-report/assets/queries/broker-by-host-app.kql new file mode 100644 index 00000000..1fcdfaaa --- /dev/null +++ b/.github/skills/release-monitoring-report/assets/queries/broker-by-host-app.kql @@ -0,0 +1,56 @@ +// Broker — HEADLINE health for ONE host app, compared by that app's version. +// Answers "is THIS app's rollout moving the broker?". The Broker runs inside a host +// app (Authenticator / Company Portal / Link to Windows); the canonical Broker section +// pools every host and keys on broker_version, which can be dominated by a host you are +// NOT shipping. Here we pin active_broker_package_name to one host and compare by +// AppInfo_Version — which, for a given host package, IS that host app's version +// (e.g. for com.azure.authenticator, AppInfo_Version == the Authenticator AppVersion). +// +// Cluster: https://idsharedeus2.kusto.windows.net +// Database: ad-accounts-android-otel +// +// Placeholders: +// <PACKAGE> active broker package, e.g. "com.azure.authenticator" +// (others: "com.microsoft.windowsintune.companyportal", "com.microsoft.appmanager"). +// <FIRST> rolling-out host-app version (AppInfo_Version), e.g. "6.2606.3817". +// <SECOND> previous/baseline host-app version, e.g. "6.2605.3042". +// <START><END> datetime bounds (yyyy-mm-dd). +// +// Both MVs expose active_broker_package_name + AppInfo_Version + countDevicesHll. +// HLL gotcha: merge the sketch, then dcount_hll — NEVER sum countDevices. +let pkg = "<PACKAGE>"; +let vers = dynamic(["<FIRST>", "<SECOND>"]); +let s = datetime(<START>); let e = datetime(<END>); +let total = materialized_view('BrokerAdoptionStatsUpdated') + | where EventInfo_Time between (s .. e) | where active_broker_package_name == pkg | where AppInfo_Version in (vers) + | summarize totalDev = dcount_hll(hll_merge(countDevicesHll)) by AppInfo_Version; +let errored = materialized_view('ErrorStatsMetrics') + | where EventInfo_Time between (s .. e) | where active_broker_package_name == pkg | where AppInfo_Version in (vers) + | where isnotempty(error_code) and tolower(error_code) != "success" + | summarize errDev = dcount_hll(hll_merge(countDevicesHll)) by AppInfo_Version; +let s_all = materialized_view('SilentAuthStatsAllRequestsMetrics') + | where EventInfo_Time between (s .. e) | where active_broker_package_name == pkg | where AppInfo_Version in (vers) + | summarize sAllReq = sum(countRequests), sAllDev = dcount_hll(hll_merge(countDevicesHll)) by AppInfo_Version; +let s_ok = materialized_view('SilentAuthStatsRequestsWithoutExpectedErrorMetrics') + | where EventInfo_Time between (s .. e) | where active_broker_package_name == pkg | where AppInfo_Version in (vers) + | summarize sOkReq = sum(countRequests), sOkDev = dcount_hll(hll_merge(countDevicesHll)) by AppInfo_Version; +let i_all = materialized_view('InteractiveAuthStatsAllRequestsMetrics') + | where EventInfo_Time between (s .. e) | where active_broker_package_name == pkg | where AppInfo_Version in (vers) + | summarize iAllReq = sum(countRequests), iAllDev = dcount_hll(hll_merge(countDevicesHll)) by AppInfo_Version; +let i_ok = materialized_view('InteractiveAuthStatsRequestsWithoutExpectedErrorMetrics') + | where EventInfo_Time between (s .. e) | where active_broker_package_name == pkg | where AppInfo_Version in (vers) + | summarize iOkReq = sum(countRequests), iOkDev = dcount_hll(hll_merge(countDevicesHll)) by AppInfo_Version; +total +| join kind=leftouter errored on AppInfo_Version +| join kind=leftouter s_all on AppInfo_Version | join kind=leftouter s_ok on AppInfo_Version +| join kind=leftouter i_all on AppInfo_Version | join kind=leftouter i_ok on AppInfo_Version +| extend errDev = coalesce(errDev, 0) +| project AppInfo_Version, + TotalDevices = totalDev, + DevicesWithError = errDev, + DeviceErrorRate = round(100.0 * errDev / totalDev, 3), + SilentReqReliability = round(100.0 * sOkReq / sAllReq, 3), + SilentDevReliability = round(100.0 * sOkDev / sAllDev, 3), + InteractiveReqReliability = round(100.0 * iOkReq / iAllReq, 3), + InteractiveDevReliability = round(100.0 * iOkDev / iAllDev, 3) +| order by AppInfo_Version desc diff --git a/.github/skills/release-monitoring-report/assets/queries/broker-error-rate-by-version.kql b/.github/skills/release-monitoring-report/assets/queries/broker-error-rate-by-version.kql new file mode 100644 index 00000000..91a56bd3 --- /dev/null +++ b/.github/skills/release-monitoring-report/assets/queries/broker-error-rate-by-version.kql @@ -0,0 +1,27 @@ +// Broker — HEADLINE device error rate per version. For each version: total distinct +// devices (from BrokerAdoptionStatsUpdated) and distinct devices that hit ANY non-success +// error (from ErrorStatsMetrics), giving an overall device error rate. This is the single +// "is the new release worse?" headline number for the Broker section. +// +// Cluster: https://idsharedeus2.kusto.windows.net +// Database: ad-accounts-android-otel +// +// Placeholders: <VERSIONS> dynamic list e.g. dynamic(['16.1.0','16.0.1']); +// <START>, <END>. +// Both MVs expose broker_version + countDevicesHll. Merge HLL; never sum countDevices. +let vers = dynamic(<VERSIONS>); +let total = materialized_view('BrokerAdoptionStatsUpdated') + | where EventInfo_Time between (datetime(<START>) .. datetime(<END>)) + | where broker_version in (vers) + | summarize totalDev = dcount_hll(hll_merge(countDevicesHll)) by broker_version; +let errored = materialized_view('ErrorStatsMetrics') + | where EventInfo_Time between (datetime(<START>) .. datetime(<END>)) + | where broker_version in (vers) + | where isnotempty(error_code) and tolower(error_code) != "success" + | summarize errDev = dcount_hll(hll_merge(countDevicesHll)) by broker_version; +total +| join kind=leftouter errored on broker_version +| extend errDev = coalesce(errDev, 0) +| extend DeviceErrorRate = round(100.0 * errDev / totalDev, 3) +| project broker_version, TotalDevices = totalDev, DevicesWithError = errDev, DeviceErrorRate +| order by broker_version desc diff --git a/.github/skills/release-monitoring-report/assets/queries/broker-errors-by-host-app-span.kql b/.github/skills/release-monitoring-report/assets/queries/broker-errors-by-host-app-span.kql new file mode 100644 index 00000000..41abf0db --- /dev/null +++ b/.github/skills/release-monitoring-report/assets/queries/broker-errors-by-host-app-span.kql @@ -0,0 +1,48 @@ +// Broker — PER-SPAN request-rate for specific error codes, one host app, two host-app versions. +// The COMPLEMENT to broker-top-errors-by-host-app.kql. That query measures DEVICE-SHARE +// (fraction of all auth-hosted-broker devices that hit code X *anywhere*, deduped across spans). +// Device-share MASKS a per-span request-rate rise: invalid_grant can climb inside +// AcquireTokenSilent / ATISilently while overall device-share falls, because the device-dedup + +// the broad denominator (dominated by the low-invalid_grant silent path) drowns it out, and an +// early-rollout cohort is "healthier" on the dominant path. When a code looks flat/down in the +// device-share table but you suspect a span-local spike, run THIS to slice by span_name. +// +// Cluster: https://idsharedeus2.kusto.windows.net +// Database: ad-accounts-android-otel +// +// Placeholders: +// <PACKAGE> active broker package, e.g. "com.azure.authenticator". +// <FIRST> rolling-out host-app version (AppInfo_Version), e.g. "6.2606.3817". +// <SECOND> previous/baseline host-app version, e.g. "6.2605.3042". +// <CODES> lower-cased error_code list to drill into, e.g. "invalid_grant","interaction_required". +// <START><END> datetime bounds (yyyy-mm-dd). +// +// Rate = errored requests for the code in span S / TOTAL requests in span S, per version +// (request-level, NOT device-level). ErrorStatsMetrics carries BOTH success and failure rows +// (error_code == "success" exists), so the per-span denominator is the full request volume. +// countOverall is additive — sum() is safe. deltaPp = firstRate - secondRate (lower is better). +// The (firstReq+secondReq) > 1000 floor drops tiny, noisy span/code cells. +let pkg = "<PACKAGE>"; +let firstVer = "<FIRST>"; let secondVer = "<SECOND>"; +let win = materialized_view('ErrorStatsMetrics') + | where EventInfo_Time between (datetime(<START>) .. datetime(<END>)) + | where active_broker_package_name == pkg + | where AppInfo_Version in (firstVer, secondVer) + | where isnotempty(span_name); +let totals = win | summarize totReq = sum(countOverall) by AppInfo_Version, span_name; +win +| where tolower(error_code) in (<CODES>) +| summarize codeReq = sum(countOverall) by AppInfo_Version, span_name, error_code +| join kind=inner totals on AppInfo_Version, span_name +| extend ratePct = 100.0 * codeReq / totReq +| summarize firstRate = sumif(ratePct, AppInfo_Version == firstVer), + secondRate = sumif(ratePct, AppInfo_Version == secondVer), + firstReq = sumif(codeReq, AppInfo_Version == firstVer), + secondReq = sumif(codeReq, AppInfo_Version == secondVer) + by error_code, span_name +| extend deltaPp = round(firstRate - secondRate, 4) +| where (firstReq + secondReq) > 1000 +| project error_code, span_name, + secondRate = round(secondRate, 4), firstRate = round(firstRate, 4), + deltaPp, firstReq, secondReq +| order by error_code asc, deltaPp desc diff --git a/.github/skills/release-monitoring-report/assets/queries/broker-latency-by-version.kql b/.github/skills/release-monitoring-report/assets/queries/broker-latency-by-version.kql new file mode 100644 index 00000000..14ec0575 --- /dev/null +++ b/.github/skills/release-monitoring-report/assets/queries/broker-latency-by-version.kql @@ -0,0 +1,22 @@ +// Broker — LATENCY percentiles per version. P50/P75/P90/P95/P99 of responseTime for the +// rolling-out version vs the previous version. Optionally narrow to a single span_name +// (e.g. an AcquireToken span) to compare a hot path; leave the span filter line out to +// compare overall broker latency. +// +// Cluster: https://idsharedeus2.kusto.windows.net +// Database: ad-accounts-android-otel +// +// Placeholders: <VERSIONS> dynamic list e.g. dynamic(['16.1.0','16.0.1']); +// <START>, <END>. Optional: replace <SPAN_FILTER> or delete that line. +// NEVER sum percentiles — always tdigest_merge then percentiles_array_tdigest. +let vers = dynamic(<VERSIONS>); +materialized_view('PerfStatsUpdated') +| where EventInfo_Time between (datetime(<START>) .. datetime(<END>)) +| where broker_version in (vers) +// | where span_name == "<SPAN_FILTER>" // optional: focus a single span/hot path +| summarize LatencyValues = percentiles_array_tdigest(tdigest_merge(responseTimeTDigest), 50, 75, 90, 95, 99) + by broker_version +| extend LatencyLabels = dynamic(['P50','P75','P90','P95','P99']) +| mv-expand LatencyLabels to typeof(string), LatencyMs = LatencyValues to typeof(long) +| project broker_version, Percentile = LatencyLabels, LatencyMs +| order by broker_version desc, Percentile asc diff --git a/.github/skills/release-monitoring-report/assets/queries/broker-reliability-by-version.kql b/.github/skills/release-monitoring-report/assets/queries/broker-reliability-by-version.kql new file mode 100644 index 00000000..cce25994 --- /dev/null +++ b/.github/skills/release-monitoring-report/assets/queries/broker-reliability-by-version.kql @@ -0,0 +1,44 @@ +// Broker — per-version RELIABILITY (silent + interactive, request + device). +// Computes reliability for the rolling-out version vs the previous/baseline version +// directly from the canonical Metrics MVs (all expose broker_version, countRequests, +// countDevicesHll). Reliability = without-expected-error / all-requests. +// +// Cluster: https://idsharedeus2.kusto.windows.net +// Database: ad-accounts-android-otel +// +// Placeholders: +// <VERSIONS> dynamic list of the versions to compare, e.g. dynamic(['16.1.0','16.0.1']) +// Use the FULL broker_version strings (resolve via broker-adoption.kql first). +// <START><END> datetime bounds (yyyy-mm-dd). +// +// HLL gotcha: NEVER sum countDevicesHll across rows. Merge the sketch, then dcount_hll. +// +// ---- Silent auth ----------------------------------------------------------------- +let vers = dynamic(<VERSIONS>); +let s_all = materialized_view('SilentAuthStatsAllRequestsMetrics') + | where EventInfo_Time between (datetime(<START>) .. datetime(<END>)) + | where broker_version in (vers) + | summarize allReq = sum(countRequests), allDev = dcount_hll(hll_merge(countDevicesHll)) by broker_version; +let s_ok = materialized_view('SilentAuthStatsRequestsWithoutExpectedErrorMetrics') + | where EventInfo_Time between (datetime(<START>) .. datetime(<END>)) + | where broker_version in (vers) + | summarize okReq = sum(countRequests), okDev = dcount_hll(hll_merge(countDevicesHll)) by broker_version; +let i_all = materialized_view('InteractiveAuthStatsAllRequestsMetrics') + | where EventInfo_Time between (datetime(<START>) .. datetime(<END>)) + | where broker_version in (vers) + | summarize iAllReq = sum(countRequests), iAllDev = dcount_hll(hll_merge(countDevicesHll)) by broker_version; +let i_ok = materialized_view('InteractiveAuthStatsRequestsWithoutExpectedErrorMetrics') + | where EventInfo_Time between (datetime(<START>) .. datetime(<END>)) + | where broker_version in (vers) + | summarize iOkReq = sum(countRequests), iOkDev = dcount_hll(hll_merge(countDevicesHll)) by broker_version; +s_all +| join kind=leftouter s_ok on broker_version +| join kind=leftouter i_all on broker_version +| join kind=leftouter i_ok on broker_version +| project broker_version, + SilentReqReliability = round(100.0 * okReq / allReq, 3), + SilentDevReliability = round(100.0 * okDev / allDev, 3), + InteractiveReqReliability = round(100.0 * iOkReq / iAllReq, 3), + InteractiveDevReliability = round(100.0 * iOkDev / iAllDev, 3), + SilentRequests = allReq, SilentDevices = allDev +| order by broker_version desc diff --git a/.github/skills/release-monitoring-report/assets/queries/broker-top-errors-by-host-app.kql b/.github/skills/release-monitoring-report/assets/queries/broker-top-errors-by-host-app.kql new file mode 100644 index 00000000..6f768e3e --- /dev/null +++ b/.github/skills/release-monitoring-report/assets/queries/broker-top-errors-by-host-app.kql @@ -0,0 +1,42 @@ +// Broker — TOP ERROR CODES for ONE host app, comparing two host-app versions. The "why" +// for broker-by-host-app.kql: is the new app release adding broker errors, and which ones? +// Same shape as broker-top-errors-by-version.kql but scoped to active_broker_package_name +// and keyed on AppInfo_Version (the host app's version) instead of broker_version. +// +// Cluster: https://idsharedeus2.kusto.windows.net +// Database: ad-accounts-android-otel +// +// Placeholders: +// <PACKAGE> active broker package, e.g. "com.azure.authenticator". +// <FIRST> rolling-out host-app version (AppInfo_Version), e.g. "6.2606.3817". +// <SECOND> previous/baseline host-app version, e.g. "6.2605.3042". +// <START><END> datetime bounds (yyyy-mm-dd). +// +// Device share = devicesHittingError / totalDevicesOnThatAppVersion, so a code that grew +// only because the version has more devices does NOT show up as a regression. +// Feed to compare-versions.js movers --lower-is-better true (share growth is bad). +// Sort firstShare desc instead to read the dominant codes; sort shareDeltaPp desc for risers. +// HLL gotcha: merge sketches, never sum countDevices. +let pkg = "<PACKAGE>"; +let firstVer = "<FIRST>"; let secondVer = "<SECOND>"; +let win = materialized_view('ErrorStatsMetrics') + | where EventInfo_Time between (datetime(<START>) .. datetime(<END>)) + | where active_broker_package_name == pkg + | where AppInfo_Version in (firstVer, secondVer); +let firstTotalDev = toscalar(win | where AppInfo_Version == firstVer | summarize dcount_hll(hll_merge(countDevicesHll))); +let secondTotalDev = toscalar(win | where AppInfo_Version == secondVer | summarize dcount_hll(hll_merge(countDevicesHll))); +win +| where isnotempty(error_code) and tolower(error_code) != "success" +| summarize devs = dcount_hll(hll_merge(countDevicesHll)), reqs = sum(countOverall) + by AppInfo_Version, error_code +| summarize firstDevs = sumif(devs, AppInfo_Version == firstVer), + secondDevs = sumif(devs, AppInfo_Version == secondVer), + firstReqs = sumif(reqs, AppInfo_Version == firstVer), + secondReqs = sumif(reqs, AppInfo_Version == secondVer) + by error_code +| extend firstShare = round(100.0 * firstDevs / firstTotalDev, 4), + secondShare = round(100.0 * secondDevs / secondTotalDev, 4) +| extend shareDeltaPp = round(firstShare - secondShare, 4) +| project error_code, firstDevs, secondDevs, firstReqs, secondReqs, + firstShare, secondShare, shareDeltaPp +| order by firstShare desc diff --git a/.github/skills/release-monitoring-report/assets/queries/broker-top-errors-by-version.kql b/.github/skills/release-monitoring-report/assets/queries/broker-top-errors-by-version.kql new file mode 100644 index 00000000..f666b9cb --- /dev/null +++ b/.github/skills/release-monitoring-report/assets/queries/broker-top-errors-by-version.kql @@ -0,0 +1,36 @@ +// Broker — TOP ERROR CODES per version, with cross-version delta. The core "why is +// this release different?" query. One row per error_code with explicit first/second +// columns (device + request counts and device-share %), plus the device-share delta in +// percentage points. Feeds compare-versions.js and the Broker "Top regressions / +// improvements" report section. +// +// Cluster: https://idsharedeus2.kusto.windows.net +// Database: ad-accounts-android-otel +// +// Placeholders: <FIRST> (rolling-out broker_version), <SECOND> (previous), <START>, <END>. +// +// Device share = devicesHittingError / totalDevicesOnThatVersion, so a code that grew +// only because the version has more devices does NOT show up as a regression. +// HLL gotcha: merge sketches, never sum countDevices. +let firstVer = "<FIRST>"; +let secondVer = "<SECOND>"; +let win = materialized_view('ErrorStatsMetrics') + | where EventInfo_Time between (datetime(<START>) .. datetime(<END>)) + | where broker_version in (firstVer, secondVer); +let firstTotalDev = toscalar(win | where broker_version == firstVer | summarize dcount_hll(hll_merge(countDevicesHll))); +let secondTotalDev = toscalar(win | where broker_version == secondVer | summarize dcount_hll(hll_merge(countDevicesHll))); +win +| where isnotempty(error_code) and tolower(error_code) != "success" +| summarize devs = dcount_hll(hll_merge(countDevicesHll)), reqs = sum(countOverall) + by broker_version, error_code +| summarize firstDevs = sumif(devs, broker_version == firstVer), + secondDevs = sumif(devs, broker_version == secondVer), + firstReqs = sumif(reqs, broker_version == firstVer), + secondReqs = sumif(reqs, broker_version == secondVer) + by error_code +| extend firstShare = round(100.0 * firstDevs / firstTotalDev, 4), + secondShare = round(100.0 * secondDevs / secondTotalDev, 4) +| extend shareDeltaPp = round(firstShare - secondShare, 4) +| project error_code, firstDevs, secondDevs, firstReqs, secondReqs, + firstShare, secondShare, shareDeltaPp +| order by shareDeltaPp desc \ No newline at end of file diff --git a/.github/skills/release-monitoring-report/assets/scripts/bootstrap-report.ps1 b/.github/skills/release-monitoring-report/assets/scripts/bootstrap-report.ps1 new file mode 100644 index 00000000..b9641758 --- /dev/null +++ b/.github/skills/release-monitoring-report/assets/scripts/bootstrap-report.ps1 @@ -0,0 +1,140 @@ +<# +.SYNOPSIS + Bootstrap a new release-monitoring report file from the canonical template. + +.DESCRIPTION + Implements SKILL.md Step 1 as a script so the workflow doesn't drift: + 1. Builds the output filename from the version(s) under test: + release-report-broker-<bv>-auth-<av>-<yyyy-MM-dd>.html + Omitted apps are dropped from the name (broker-only or auth-only runs are fine). + 2. Creates ~/android-release-reports/_data/<stamp>/ for raw query payloads, where + <stamp> is the version-pair slug (so two different releases on the same day get + separate data folders). + 3. Collision rule: if the target file already exists and is an UNFILLED template stub + (its fingerprint markers still match the canonical template AND size within 5%), + re-bootstrap silently. Otherwise HALT — a populated report must be explicitly + deleted/renamed or regenerated with -Force. + 4. Prunes _data/* folders older than -DataRetentionDays (default 60). + 5. Stamps today's date into the "Generated <strong>...</strong>" banner. + + At least one of -BrokerVersion / -AuthVersion is REQUIRED. + +.PARAMETER BrokerVersion + Broker version rolling out (e.g. 16.1.0). Optional if -AuthVersion is given. + +.PARAMETER AuthVersion + Authenticator version rolling out (e.g. 6.2606.3817). Optional if -BrokerVersion is given. + +.PARAMETER Force + Skip the collision check and overwrite any existing file. + +.PARAMETER DataRetentionDays + How many days of _data/* folders to keep before pruning. Default 60. + +.PARAMETER SkillRoot + Path to the skill's assets folder. Defaults to two levels up from this script. + +.EXAMPLE + .\bootstrap-report.ps1 -BrokerVersion 16.1.0 -AuthVersion 6.2606.3817 + +.EXAMPLE + .\bootstrap-report.ps1 -BrokerVersion 16.1.0 -Force # broker-only + +.OUTPUTS + Prints the absolute path of the newly created report file (last line). +#> +[CmdletBinding()] +param( + [string]$BrokerVersion, + [string]$AuthVersion, + [switch]$Force, + [int]$DataRetentionDays = 60, + [string]$SkillRoot +) +$ErrorActionPreference = 'Stop' + +if (-not $BrokerVersion -and -not $AuthVersion) { + throw "Provide at least one of -BrokerVersion / -AuthVersion." +} + +# Locate the skill's assets folder + canonical template +if (-not $SkillRoot) { + # This script lives at <skill>/assets/scripts/bootstrap-report.ps1 -> go up 2 to <skill>/assets + $SkillRoot = Split-Path -Parent (Split-Path -Parent $PSCommandPath) +} +$template = Join-Path $SkillRoot 'templates\report-template.html' +if (-not (Test-Path $template)) { + throw "Canonical template not found at $template. Pass -SkillRoot if running outside the skill folder." +} + +# Build filename + data-folder slug +$today = (Get-Date).ToString('yyyy-MM-dd') +$parts = @() +$slugParts = @() +if ($BrokerVersion) { $parts += "broker-$BrokerVersion"; $slugParts += "b$BrokerVersion" } +if ($AuthVersion) { $parts += "auth-$AuthVersion"; $slugParts += "a$AuthVersion" } +$nameCore = ($parts -join '-') +$slug = (($slugParts -join '-') -replace '[^0-9A-Za-z\.\-]', '_') + +$reportDir = Join-Path $env:USERPROFILE 'android-release-reports' +$dataDir = Join-Path $reportDir "_data\$slug-$today" +$out = Join-Path $reportDir "release-report-$nameCore-$today.html" +New-Item -ItemType Directory -Force $reportDir | Out-Null +New-Item -ItemType Directory -Force $dataDir | Out-Null + +# Fingerprint markers to detect an unfilled stub +$templateText = [IO.File]::ReadAllText($template) +function Get-FingerprintMarkers([string]$text) { + $m = @{} + if ($text -match '<title>([^<]+?)') { $m['title'] = $Matches[1].Trim() } + if ($text -match '
\s*([^<]+)') { $m['metaVer'] = $Matches[1].Trim() } + if ($text -match '
\s*
[^<]+
\s*
([^<]+?)
') { $m['firstKpi'] = $Matches[1].Trim() } + return $m +} +$templateMarkers = Get-FingerprintMarkers $templateText + +if ((Test-Path $out) -and -not $Force) { + $existingText = [IO.File]::ReadAllText($out) + $existingMarkers = Get-FingerprintMarkers $existingText + $allMatch = $true + foreach ($k in $templateMarkers.Keys) { + if ($existingMarkers[$k] -ne $templateMarkers[$k]) { $allMatch = $false; break } + } + $sizeRatio = (Get-Item $out).Length / [Math]::Max(1, (Get-Item $template).Length) + $isUnfilledStub = $allMatch -and ($sizeRatio -ge 0.95) -and ($sizeRatio -le 1.05) + if ($isUnfilledStub) { + Write-Warning "Existing $out is an unfilled template stub. Re-bootstrapping silently." + } else { + Write-Error @" +A populated report already exists at: + $out +Per the filename-collision rule, do NOT silently overwrite. Either: + 1. Open it, confirm what changed vs the new data, then re-run with -Force. + 2. Rename / delete it and re-run. +"@ + exit 2 + } +} + +Copy-Item $template $out -Force +Write-Host "Bootstrapped $out" +Write-Host "Data folder: $dataDir" + +# Stamp the actual run date (UTF8-no-BOM to preserve emoji/arrows) +$outText = [IO.File]::ReadAllText($out) +$outText = [regex]::Replace($outText, 'Generated\s+[^<]*', "Generated $today") +[IO.File]::WriteAllText($out, $outText, [System.Text.UTF8Encoding]::new($false)) +Write-Host "Stamped Generated date: $today" + +# Prune old _data folders +$dataRoot = Join-Path $reportDir '_data' +if (Test-Path $dataRoot) { + $cutoff = (Get-Date).AddDays(-$DataRetentionDays) + $old = Get-ChildItem $dataRoot -Directory | Where-Object { $_.FullName -ne $dataDir -and $_.LastWriteTime -lt $cutoff } + if ($old) { + Write-Host "Pruning $($old.Count) _data folder(s) older than $DataRetentionDays days." + $old | ForEach-Object { Remove-Item -Recurse -Force $_.FullName } + } +} + +Write-Output $out diff --git a/.github/skills/release-monitoring-report/assets/scripts/compare-versions.js b/.github/skills/release-monitoring-report/assets/scripts/compare-versions.js new file mode 100644 index 00000000..1e69f5ed --- /dev/null +++ b/.github/skills/release-monitoring-report/assets/scripts/compare-versions.js @@ -0,0 +1,134 @@ +#!/usr/bin/env node +/* + * compare-versions.js — release delta + classification engine. + * + * Reads the array-form JSON that run-kql.ps1 emits: + * { "results": { "items": [ [col0,col1,...], [row...], ... ] } } // row 0 = column names + * (Also tolerates the columns/rows object form just in case.) + * + * Two modes: + * + * 1) rows — one row PER VERSION, metrics in columns. Computes first−second delta for + * each metric and classifies regression / improvement / flat with a pp + * threshold and a volume guard. + * node compare-versions.js rows --file r.json \ + * --version-col broker_version --first 16.1.0 --second 16.0.1 \ + * --metrics SilentDevReliability,InteractiveDevReliability \ + * --lower-is-better DeviceErrorRate \ + * --volume-col SilentDevices --volume-floor 1000 --threshold 1.0 + * + * 2) movers — rows ALREADY paired (one row per error_code/scenario with first/second + * columns). Ranks by the delta column and tags direction. + * node compare-versions.js movers --file e.json \ + * --key-col error_code --first-col firstShare --second-col secondShare \ + * --delta-col shareDeltaPp --top 10 --threshold 0.5 + * + * Output: JSON to stdout — a structured verdict array the report author pastes/reasons over. + * Higher-is-better by default; pass --lower-is-better for latency/error-rate metrics. + */ + +function parseArgs(argv) { + const a = { _: [] }; + for (let i = 0; i < argv.length; i++) { + const t = argv[i]; + if (t.startsWith('--')) { const k = t.slice(2); const v = (argv[i+1] && !argv[i+1].startsWith('--')) ? argv[++i] : true; a[k] = v; } + else a._.push(t); + } + return a; +} + +function loadItems(file) { + const raw = JSON.parse(require('fs').readFileSync(file, 'utf8')); + let items = raw && raw.results && raw.results.items; + if (!items) throw new Error('No results.items in ' + file); + // object-form fallback: items = { columns:[{ColumnName}], rows:[[...]] } + if (!Array.isArray(items)) { + const cols = (items.columns || []).map(c => c.ColumnName || c.name || c); + items = [cols, ...(items.rows || [])]; + } + const cols = items[0]; + const rows = items.slice(1); + return { cols, rows }; +} + +function idx(cols, name, label) { + const i = cols.indexOf(name); + if (i < 0) throw new Error('Column "' + name + '" not found (' + (label||'') + '). Available: ' + cols.join(', ')); + return i; +} + +const num = v => { const n = parseFloat(String(v).replace(/[, %]/g, '')); return Number.isFinite(n) ? n : null; }; + +function classify(deltaPp, threshold, lowerIsBetter, lowVolume) { + if (lowVolume) return 'low-volume'; + if (Math.abs(deltaPp) < threshold) return 'flat'; + const improved = lowerIsBetter ? deltaPp < 0 : deltaPp > 0; + return improved ? 'improvement' : 'regression'; +} + +function rowsMode(a) { + const { cols, rows } = loadItems(a.file); + const vc = idx(cols, a['version-col'], 'version-col'); + const findRow = v => rows.find(r => String(r[vc]) === String(v)); + const r1 = findRow(a.first), r2 = findRow(a.second); + if (!r1) throw new Error('first version "' + a.first + '" not in data'); + if (!r2) throw new Error('second version "' + a.second + '" not in data'); + const metrics = String(a.metrics || '').split(',').map(s => s.trim()).filter(Boolean); + const lower = new Set(String(a['lower-is-better'] || '').split(',').map(s => s.trim()).filter(Boolean)); + const threshold = parseFloat(a.threshold || '1.0'); + const volFloor = a['volume-col'] ? parseFloat(a['volume-floor'] || '0') : null; + const volIdx = a['volume-col'] ? idx(cols, a['volume-col'], 'volume-col') : -1; + const firstVol = volIdx >= 0 ? num(r1[volIdx]) : null; + const out = metrics.map(m => { + const mi = idx(cols, m, 'metric'); + const f = num(r1[mi]), s = num(r2[mi]); + const delta = (f != null && s != null) ? +(f - s).toFixed(4) : null; + const lowVol = volFloor != null && firstVol != null && firstVol < volFloor; + return { + metric: m, first: f, second: s, deltaPp: delta, + lowerIsBetter: lower.has(m), + verdict: delta == null ? 'no-data' : classify(delta, threshold, lower.has(m), lowVol) + }; + }); + return { mode: 'rows', first: a.first, second: a.second, firstVolume: firstVol, threshold, metrics: out }; +} + +function moversMode(a) { + const { cols, rows } = loadItems(a.file); + const kc = idx(cols, a['key-col'], 'key-col'); + const dc = idx(cols, a['delta-col'], 'delta-col'); + const fc = a['first-col'] ? idx(cols, a['first-col']) : -1; + const sc = a['second-col'] ? idx(cols, a['second-col']) : -1; + const threshold = parseFloat(a.threshold || '0.5'); + const top = parseInt(a.top || '10', 10); + const lowerIsBetter = a['lower-is-better'] === true || a['lower-is-better'] === 'true'; + const all = rows.map(r => { + const delta = num(r[dc]); + return { + key: r[kc], + first: fc >= 0 ? num(r[fc]) : null, + second: sc >= 0 ? num(r[sc]) : null, + deltaPp: delta, + verdict: delta == null ? 'no-data' : classify(delta, threshold, lowerIsBetter, false) + }; + }).filter(x => x.deltaPp != null); + all.sort((x, y) => Math.abs(y.deltaPp) - Math.abs(x.deltaPp)); + const regressions = all.filter(x => x.verdict === 'regression').slice(0, top); + const improvements = all.filter(x => x.verdict === 'improvement').slice(0, top); + return { mode: 'movers', threshold, top, regressions, improvements }; +} + +function main() { + const a = parseArgs(process.argv.slice(2)); + const mode = a._[0]; + if (!a.file || !mode) { + console.error('usage: compare-versions.js --file ... (see header)'); + process.exit(2); + } + let res; + if (mode === 'rows') res = rowsMode(a); + else if (mode === 'movers') res = moversMode(a); + else { console.error('unknown mode: ' + mode); process.exit(2); } + process.stdout.write(JSON.stringify(res, null, 2) + '\n'); +} +main(); diff --git a/.github/skills/release-monitoring-report/assets/scripts/fetch-appcenter-crashes.js b/.github/skills/release-monitoring-report/assets/scripts/fetch-appcenter-crashes.js new file mode 100644 index 00000000..4b1f2619 --- /dev/null +++ b/.github/skills/release-monitoring-report/assets/scripts/fetch-appcenter-crashes.js @@ -0,0 +1,482 @@ +#!/usr/bin/env node +/* + * fetch-appcenter-crashes.js — pull Authenticator crash data from App Center Diagnostics + * and emit the SAME array-form JSON that run-kql.ps1 produces, so compare-versions.js and + * the report-fill flow consume it unchanged: + * { "results": { "items": [ [col0,col1,...], [row...], ... ] } } // row 0 = column names + * + * Why App Center (not Play Console): App Center's Diagnostics/errorGroups API is the only + * source that returns DETAILED crash clusters (exception type, crashing class/method/line, + * per-version counts, device counts) filterable by app version — Play Console exports only + * aggregate numbers, never per-crash detail. App Center *Analytics* (crash_counts / + * crashfree_users / sessions) is RETIRED (410/404 or drained to ~0), so DO NOT use it; a true + * crash-free rate must take its denominator from Kusto telemetry (active devices per AppVersion) + * — see assets/docs/crash-sources.md. Scope is Authenticator ONLY (Broker is not a store app). + * + * Auth: an App Center read-only User API token. Resolution order: + * 1) --token-file 2) $APPCENTER_API_TOKEN 3) ~/.android-release-reports/appcenter.token + * The token is a SECRET — keep it out of the repo and never echo it. + * + * Modes: + * + * groups — top crash clusters for ONE version. Each row's crashSharePct is its share of that + * version's total crashes (over all fetched groups), mirroring the Broker error-movers + * "device-share" idea so growth is normalized for cohort size. + * node fetch-appcenter-crashes.js groups --owner authapp-t7qc \ + * --app Microsoft-Authenticator-Android-Prod-App-Center \ + * --version 6.2606.3817 --days 14 --top 15 --out groups-new.json + * + * enrich — for the top-N crash signatures on ONE version, pull what the list view can't show: + * the per-group DAILY TREND (errorCountsPerDay → rising / decaying / spike-then-decay / + * steady — separates an early-rollout spike from a sustained regression, pattern P4) and + * an instance-sampled OS-major + device-model CONCENTRATION (errorGroups/{id}/errors → + * is the crash one OS / one OEM, pattern P6). App Center's aggregate operatingSystemCounts + * / modelCounts / affectedDeviceCounts endpoints 404 for this app, so the per-group daily + * series and a capped instance sample are the only routes to trend + dimensions. + * node fetch-appcenter-crashes.js enrich --owner authapp-t7qc \ + * --app Microsoft-Authenticator-Android-Prod-App-Center \ + * --version 6.2606.3817 --days 14 --top 8 --out crash-enrich.json + * + * diff — pair TWO versions (--version = rolling out, --base = previous) by crash SIGNATURE + * (codeRaw/label, aggregating sub-groups). NOTE: App Center's errorGroupId is + * version-scoped (0 id overlap across versions), so the cross-version join MUST be on + * the signature, not the id. Computes each cluster's crash-share on each version + the + * delta (pp), and tags status (new / regressed / improved / flat / gone). Output is + * already paired, so compare-versions.js movers ranks it directly: + * node fetch-appcenter-crashes.js diff --owner authapp-t7qc \ + * --app Microsoft-Authenticator-Android-Prod-App-Center \ + * --version 6.2606.3817 --base 6.2605.3042 --days 14 \ + * --devices-new 12619500 --devices-base 74905873 --out crash-diff.json + * node compare-versions.js movers --file crash-diff.json \ + * --key-col label --first-col basePer1k --second-col newPer1k \ + * --delta-col rateDeltaPer1k --lower-is-better true --top 10 + * + * --devices-new / --devices-base are OPTIONAL Kusto active-device denominators (run + * authenticator-crash-denominator.kql). When supplied, diff computes the honest + * crashes-per-1k-active-devices RATE and derives status/ranking from it instead of from + * crash-share. Without them, status falls back to crash-share deltas (less reliable when + * the two versions' total crash pools differ greatly in size). + * + * newcrashes — "what crashes are GENUINELY NEW in this release?" Anti-joins the new version's + * signatures against the UNION of several recent PRIOR versions (NOT just the immediate + * baseline). This exists because App Center's per-version firstOccurrence is the version's + * ROLLOUT date, NOT the signature's app-history first-seen — so `diff` can mark a crash + * "new" (absent from the single base) when it actually predates that base by months + * (verified: the okhttp http-cache journal IOException shows firstOccurrence = rollout day + * on a young build yet exists on every version back many releases). A signature earns + * "genuinely-new" only when it is absent from ALL listed priors within the 27-day API + * window AND present on the new build. Defaults to --page-cap 0 (exhaust) so a "new" + * verdict can't miss a low-count prior occurrence. + * node fetch-appcenter-crashes.js newcrashes --owner authapp-t7qc \ + * --app Microsoft-Authenticator-Android-Prod-App-Center \ + * --version 6.2606.3817 --priors 6.2605.3042,6.2605.2973,6.2604.2550,6.2603.1485 \ + * --days 14 --min-count 5 --devices-new 34580000 --out new-crashes.json + * + * signature — cross-version presence of ONE crash SIGNATURE (+ optional daily trend on the primary + * version): "is crash X specific to this release, or pre-existing across versions?" Pages + * each version's groups and aggregates every group whose codeRaw / class.method:line / + * exceptionMessage / exceptionType contains --match. Pass --trend to tag the primary + * version's daily series (rising / decaying / spike-then-decay / steady). + * node fetch-appcenter-crashes.js signature --owner authapp-t7qc \ + * --app Microsoft-Authenticator-Android-Prod-App-Center \ + * --version 6.2606.3817 --priors 6.2605.3042,6.2605.2973,6.2604.2550,6.2603.1485 \ + * --match "FileSystem$1.rename" --days 27 --trend --out sig.json + * + * --out is optional; without it the JSON goes to stdout. Totals/diagnostics go to stderr. + */ + +const https = require('https'); +const fs = require('fs'); +const path = require('path'); +const os = require('os'); + +function parseArgs(argv) { + const a = { _: [] }; + for (let i = 0; i < argv.length; i++) { + const t = argv[i]; + if (t.startsWith('--')) { const k = t.slice(2); const v = (argv[i + 1] && !argv[i + 1].startsWith('--')) ? argv[++i] : true; a[k] = v; } + else a._.push(t); + } + return a; +} + +function resolveToken(a) { + if (a['token-file']) return fs.readFileSync(a['token-file'], 'utf8').trim(); + if (process.env.APPCENTER_API_TOKEN) return process.env.APPCENTER_API_TOKEN.trim(); + const def = path.join(os.homedir(), '.android-release-reports', 'appcenter.token'); + if (fs.existsSync(def)) return fs.readFileSync(def, 'utf8').trim(); + throw new Error('No App Center token. Pass --token-file, set $APPCENTER_API_TOKEN, or place it at ' + def); +} + +function getJson(url, token) { + return new Promise((resolve, reject) => { + https.get(url, { headers: { 'X-API-Token': token, 'Accept': 'application/json' } }, res => { + let buf = ''; + res.on('data', d => buf += d); + res.on('end', () => { + if (res.statusCode < 200 || res.statusCode >= 300) { + return reject(new Error('HTTP ' + res.statusCode + ' for ' + url + ' :: ' + buf.slice(0, 300))); + } + try { resolve(JSON.parse(buf)); } catch (e) { reject(new Error('Bad JSON from ' + url + ': ' + e.message)); } + }); + }).on('error', reject); + }); +} + +const API = 'https://api.appcenter.ms/v0.1'; + +// Normalize App Center's relative nextLink (it comes back with an extra "/api" prefix that 404s +// against the public host) to an absolute URL. +const absLink = next => next ? (next.startsWith('http') ? next : 'https://api.appcenter.ms' + next.replace(/^\/api\//, '/')) : null; + +// Fetch ALL error groups for a version (follows nextLink) so the crash TOTAL — the denominator for +// crash-share and the numerator for the per-1k rate — is over the full set, not just the first page. +// App Center has NO working aggregate total endpoint for this app (errorCounts / affectedDeviceCounts +// 404; version-level errorCountsPerDay drains to 0), so summing every page is the only accurate total. +// pageCap === 0 (or 'all') ⇒ exhaust (hard safety stop at 100 pages ≈ 10k groups). Groups the team has +// triaged as noise (hidden, or state === "Ignored") are dropped by default so they don't inflate the +// rate — pass includeHidden to keep them. Returns the filtered groups; logs drops + truncation. +async function fetchGroups(owner, app, version, startIso, token, pageCap, includeHidden) { + const base = `${API}/apps/${owner}/${app}/errors/errorGroups`; + let url = `${base}?version=${encodeURIComponent(version)}&start=${encodeURIComponent(startIso)}&$top=100&$orderby=${encodeURIComponent('count desc')}`; + const cap = (pageCap === 0 || pageCap === 'all') ? 100 : (pageCap || 12); + const raw = []; + let page = 0; + for (; page < cap && url; page++) { + const r = await getJson(url, token); + for (const g of (r.errorGroups || [])) raw.push(g); + url = absLink(r.nextLink); + } + const groups = includeHidden ? raw : raw.filter(g => g.hidden !== true && g.state !== 'Ignored'); + const dropped = raw.length - groups.length; + if (dropped) process.stderr.write(` (${version}) dropped ${dropped} hidden/ignored group(s) from totals\n`); + if (url) process.stderr.write(` (${version}) WARNING: hit page cap ${cap} with more pages remaining — total is UNDERCOUNTED; pass --page-cap 0 to exhaust\n`); + return groups; +} + +const labelOf = g => g.codeRaw || g.exceptionMethod || g.exceptionType || g.errorGroupId; +// The first-party crash site as class.method:line — the most actionable attribution detail in the +// list view. NOTE App Center's exceptionClassMethod / exceptionAppCode are BOOLEAN flags (not frame +// strings); the frame lives in exceptionClassName + exceptionMethod + exceptionLine. +const appFrameOf = g => { + const cls = (typeof g.exceptionClassName === 'string' && g.exceptionClassName) ? g.exceptionClassName : ''; + const mth = (typeof g.exceptionMethod === 'string' && g.exceptionMethod) ? g.exceptionMethod : ''; + const frame = (cls && mth) ? `${cls}.${mth}` : (cls || mth || (typeof g.codeRaw === 'string' ? g.codeRaw : '')); + return (frame && g.exceptionLine) ? `${frame}:${g.exceptionLine}` : frame; +}; +const pct = (n, d) => d > 0 ? +(100 * n / d).toFixed(2) : 0; + +// Aggregate version-scoped errorGroups into ONE entry per crash SIGNATURE (codeRaw/label), since the +// same crash gets a different errorGroupId on every version. Sums count/devices across sub-groups that +// share a crashing frame; keeps the earliest firstOccurrence + latest lastOccurrence + the sub-group ids. +function aggBySig(groups) { + const m = new Map(); + for (const g of groups) { + const key = labelOf(g); + let e = m.get(key); + if (!e) { e = { label: key, exceptionType: g.exceptionType || '', appCodeFrame: appFrameOf(g), exceptionMessage: (g.exceptionMessage || '').slice(0, 160), firstOccurrence: '', lastOccurrence: '', count: 0, devices: 0, ids: [] }; m.set(key, e); } + e.count += g.count || 0; + e.devices += g.deviceCount || 0; // sum is an upper bound (a device can hit >1 sub-group) + if (!e.exceptionType) e.exceptionType = g.exceptionType || ''; + if (!e.appCodeFrame) e.appCodeFrame = appFrameOf(g); + if (!e.exceptionMessage) e.exceptionMessage = (g.exceptionMessage || '').slice(0, 160); + if (g.firstOccurrence && (!e.firstOccurrence || g.firstOccurrence < e.firstOccurrence)) e.firstOccurrence = g.firstOccurrence; + if (g.lastOccurrence && (!e.lastOccurrence || g.lastOccurrence > e.lastOccurrence)) e.lastOccurrence = g.lastOccurrence; + if (g.errorGroupId) e.ids.push(g.errorGroupId); + } + return m; +} + +function startIsoFromArgs(a) { + if (a.start) return a.start; + const days = parseInt(a.days || '14', 10); + return new Date(Date.now() - days * 86400000).toISOString().replace(/\.\d+Z$/, 'Z'); +} + +// --page-cap N (default 12) or 0/"all" to exhaust. Higher = more accurate total (App Center has no +// working aggregate-total endpoint for this app). +function pageCapFromArgs(a, dflt) { + const v = a['page-cap']; + if (v === undefined) return dflt; + if (v === 'all' || v === '0' || v === 0) return 0; + return parseInt(v, 10); +} + +function emit(obj, out) { + const json = JSON.stringify(obj); + if (out) { fs.writeFileSync(out, json, 'utf8'); process.stderr.write('Saved -> ' + out + '\n'); } + else process.stdout.write(json + '\n'); +} + +async function groupsMode(a, token) { + const startIso = startIsoFromArgs(a); + const top = parseInt(a.top || '15', 10); + const groups = await fetchGroups(a.owner, a.app, a.version, startIso, token, pageCapFromArgs(a, 12), !!a['include-hidden']); + const total = groups.reduce((s, g) => s + (g.count || 0), 0); + process.stderr.write(`version ${a.version}: ${groups.length} crash groups, ${total} total crashes since ${startIso}\n`); + const cols = ['errorGroupId', 'exceptionType', 'label', 'appCodeFrame', 'exceptionMessage', 'count', 'deviceCount', 'crashSharePct', 'firstOccurrence', 'lastOccurrence', 'appBuild', 'state']; + const rows = groups.slice(0, top).map(g => [ + g.errorGroupId, g.exceptionType || '', labelOf(g), appFrameOf(g), (g.exceptionMessage || '').slice(0, 160), + g.count || 0, g.deviceCount || 0, pct(g.count || 0, total), + g.firstOccurrence || '', g.lastOccurrence || '', g.appBuild || '', g.state || '' + ]); + emit({ meta: { version: a.version, totalCrashes: total, groupCount: groups.length, start: startIso }, results: { items: [cols, ...rows] } }, a.out); +} + +async function diffMode(a, token) { + if (!a.base) throw new Error('diff mode needs --base '); + const startIso = startIsoFromArgs(a); + const top = parseInt(a.top || '20', 10); + const cap = pageCapFromArgs(a, 12); + // OPTIONAL Kusto denominators (active devices per version). When supplied, the honest + // crashes-per-1k-active-devices RATE is computed and drives status/ranking — crash-SHARE + // alone is misleading when the two versions' total crash pools differ in size (a signature + // can take a bigger SHARE of a much smaller pool while its per-device rate actually drops). + const devNew = a['devices-new'] ? parseFloat(a['devices-new']) : null; + const devBase = a['devices-base'] ? parseFloat(a['devices-base']) : null; + const haveRate = devNew > 0 && devBase > 0; + const [gNew, gBase] = await Promise.all([ + fetchGroups(a.owner, a.app, a.version, startIso, token, cap, !!a['include-hidden']), + fetchGroups(a.owner, a.app, a.base, startIso, token, cap, !!a['include-hidden']), + ]); + const totNew = gNew.reduce((s, g) => s + (g.count || 0), 0); + const totBase = gBase.reduce((s, g) => s + (g.count || 0), 0); + process.stderr.write(`new ${a.version}: ${totNew} crashes / ${gNew.length} groups ; base ${a.base}: ${totBase} crashes / ${gBase.length} groups\n`); + if (haveRate) { + process.stderr.write(`rate: new ${(1000 * totNew / devNew).toFixed(2)}/1k (${devNew} dev) vs base ${(1000 * totBase / devBase).toFixed(2)}/1k (${devBase} dev)\n`); + } + + // App Center's errorGroupId is VERSION-SCOPED (verified: 0 id overlap across versions, 116 + // codeRaw/label overlap), so join cross-version on the crash SIGNATURE (codeRaw/label), + // aggregating sub-groups that share a crashing frame. + const agg = groups => { + const m = new Map(); + for (const g of groups) { + const key = labelOf(g); + let e = m.get(key); + if (!e) { e = { label: key, exceptionType: g.exceptionType || '', appCodeFrame: '', exceptionMessage: '', firstOccurrence: '', count: 0, devices: 0 }; m.set(key, e); } + e.count += g.count || 0; + e.devices += g.deviceCount || 0; // sum is an upper bound (a device can hit >1 sub-group) + if (!e.exceptionType) e.exceptionType = g.exceptionType || ''; + if (!e.appCodeFrame) e.appCodeFrame = appFrameOf(g); + if (!e.exceptionMessage) e.exceptionMessage = (g.exceptionMessage || '').slice(0, 160); + // earliest first-seen across sub-groups — lets the report tell a genuinely-new signature + // from one that merely fell out of the (capped) base list. + if (g.firstOccurrence && (!e.firstOccurrence || g.firstOccurrence < e.firstOccurrence)) e.firstOccurrence = g.firstOccurrence; + } + return m; + }; + const mNew = agg(gNew), mBase = agg(gBase); + const keys = new Set([...mNew.keys(), ...mBase.keys()]); + const rows = [...keys].map(k => { + const n = mNew.get(k), b = mBase.get(k); + const newCount = n ? n.count : 0, baseCount = b ? b.count : 0; + const newShare = pct(newCount, totNew), baseShare = pct(baseCount, totBase); + const shareDeltaPp = +(newShare - baseShare).toFixed(2); + const newPer1k = haveRate ? +(1000 * newCount / devNew).toFixed(3) : null; + const basePer1k = haveRate ? +(1000 * baseCount / devBase).toFixed(3) : null; + const rateDeltaPer1k = haveRate ? +(newPer1k - basePer1k).toFixed(3) : null; + let status; + if (haveRate) { + // Status from the per-device RATE (honest), not share. + const rel = basePer1k > 0 ? (newPer1k - basePer1k) / basePer1k : (newPer1k > 0 ? Infinity : 0); + if (baseCount === 0 && newCount > 0) status = 'new'; + else if (newCount === 0 && baseCount > 0) status = 'gone'; + else if (rel >= 0.15 && rateDeltaPer1k >= 0.02) status = 'regressed'; + else if (rel <= -0.15 && rateDeltaPer1k <= -0.02) status = 'improved'; + else status = 'flat'; + } else { + if (baseCount === 0 && newCount > 0) status = 'new'; + else if (newCount === 0 && baseCount > 0) status = 'gone'; + else if (shareDeltaPp >= 0.5) status = 'regressed'; + else if (shareDeltaPp <= -0.5) status = 'improved'; + else status = 'flat'; + } + return { label: k, exceptionType: (n && n.exceptionType) || (b && b.exceptionType) || '', appCodeFrame: (n && n.appCodeFrame) || (b && b.appCodeFrame) || '', exceptionMessage: (n && n.exceptionMessage) || (b && b.exceptionMessage) || '', firstOccurrenceNew: (n && n.firstOccurrence) || '', baseCount, newCount, basePer1k, newPer1k, rateDeltaPer1k, baseShare, newShare, shareDeltaPp, newDevices: n ? n.devices : 0, status }; + }); + // Sort by prevalence on the NEW build (per-1k rate if known, else crash-share); movers + // re-ranks by its own delta-col internally, so this governs only the human-readable file. + rows.sort((x, y) => (haveRate ? (y.newPer1k - x.newPer1k) : (y.newShare - x.newShare))); + + const cols = ['label', 'exceptionType', 'appCodeFrame', 'exceptionMessage', 'firstOccurrenceNew', 'baseCount', 'newCount', 'basePer1k', 'newPer1k', 'rateDeltaPer1k', 'baseSharePct', 'newSharePct', 'shareDeltaPp', 'newDevices', 'status']; + const items = rows.slice(0, top).map(r => [ + r.label, r.exceptionType, r.appCodeFrame, r.exceptionMessage, r.firstOccurrenceNew, + r.baseCount, r.newCount, r.basePer1k, r.newPer1k, r.rateDeltaPer1k, + r.baseShare, r.newShare, r.shareDeltaPp, r.newDevices, r.status + ]); + emit({ meta: { version: a.version, base: a.base, totalCrashesNew: totNew, totalCrashesBase: totBase, devicesNew: devNew, devicesBase: devBase, newRatePer1k: haveRate ? +(1000 * totNew / devNew).toFixed(3) : null, baseRatePer1k: haveRate ? +(1000 * totBase / devBase).toFixed(3) : null, start: startIso }, results: { items: [cols, ...items] } }, a.out); +} + +// Classify a per-group daily series into a trend tag (pattern P4: tell an early-rollout spike that +// decays from a sustained regression). Compares the first vs second half of the window and where the +// peak day sits. +function trendOf(days) { + const nz = days.filter(d => d.count > 0); + const total = days.reduce((s, d) => s + d.count, 0); + if (total === 0) return { trend: 'none', total: 0, peakDay: '', lastDay: '', half1: 0, half2: 0 }; + const mid = Math.floor(days.length / 2); + const half1 = days.slice(0, mid).reduce((s, d) => s + d.count, 0); + const half2 = days.slice(mid).reduce((s, d) => s + d.count, 0); + const peak = days.reduce((p, d) => d.count > p.count ? d : p, days[0]); + const peakIdx = days.indexOf(peak); + const tail = days.slice(-3).reduce((s, d) => s + d.count, 0) / Math.min(3, days.length); + let trend; + if (total < 30) trend = 'low-volume'; + else if (peakIdx < days.length - 3 && tail < peak.count * 0.4) trend = 'spike-then-decay'; + else if (half2 >= half1 * 1.5) trend = 'rising'; + else if (half1 >= half2 * 1.5) trend = 'decaying'; + else trend = 'steady'; + const dt = d => (d.datetime || '').slice(0, 10); + return { trend, total, peakDay: dt(peak), lastDay: dt(nz.length ? nz[nz.length - 1] : peak), half1, half2 }; +} + +// For the top-N signatures on ONE version, pull the per-group daily TREND (P4) and an instance-sampled +// OS-major + device-model CONCENTRATION (P6) — the diagnostics the list view can't give. Aggregate +// endpoints (operatingSystemCounts/modelCounts) 404 for this app, so dimensions come from a capped +// sample of errorGroups/{id}/errors instances (osVersion / deviceName / country). +async function enrichMode(a, token) { + const startIso = startIsoFromArgs(a); + const top = parseInt(a.top || '8', 10); + const instPages = parseInt(a['instance-pages'] || '4', 10); // 4 pages ≈ 400 instances sampled / group + const base = `${API}/apps/${a.owner}/${a.app}/errors`; + const groups = await fetchGroups(a.owner, a.app, a.version, startIso, token, pageCapFromArgs(a, 12), !!a['include-hidden']); + const picked = groups.slice(0, top); + process.stderr.write(`enrich ${a.version}: ${picked.length} top signatures (trend + instance-sampled OS/model)\n`); + const items = []; + for (const g of picked) { + const id = g.errorGroupId; + let trend = { trend: 'n/a', peakDay: '', lastDay: '', half1: 0, half2: 0 }; + try { + const d = await getJson(`${base}/errorGroups/${id}/errorCountsPerDay?version=${encodeURIComponent(a.version)}&start=${encodeURIComponent(startIso)}`, token); + trend = trendOf(d.errors || []); + } catch (e) { process.stderr.write(` trend ${id}: ${e.message}\n`); } + // instance sample → OS-major + model concentration + const oss = {}, mods = {}; let sampled = 0; + let url = `${base}/errorGroups/${id}/errors?version=${encodeURIComponent(a.version)}&start=${encodeURIComponent(startIso)}&$top=100`; + for (let p = 0; p < instPages && url; p++) { + let r; try { r = await getJson(url, token); } catch (e) { break; } + for (const ev of (r.errors || [])) { sampled++; const om = String(ev.osVersion || '').split('.')[0] || '?'; oss[om] = (oss[om] || 0) + 1; const dn = ev.deviceName || '?'; mods[dn] = (mods[dn] || 0) + 1; } + url = absLink(r.nextLink); + } + const top1 = o => { const e = Object.entries(o).sort((x, y) => y[1] - x[1])[0]; return e ? { k: e[0], pct: sampled ? +(100 * e[1] / sampled).toFixed(1) : 0 } : { k: '', pct: 0 }; }; + const o1 = top1(oss), m1 = top1(mods); + items.push([ + labelOf(g), g.exceptionType || '', appFrameOf(g), g.count || 0, g.deviceCount || 0, + trend.trend, trend.peakDay, trend.lastDay, + o1.k, o1.pct, m1.k, m1.pct, sampled + ]); + } + const cols = ['label', 'exceptionType', 'appCodeFrame', 'count', 'deviceCount', 'trend', 'peakDay', 'lastDay', 'topOsMajor', 'osConcentrationPct', 'topModel', 'modelConcentrationPct', 'sampleN']; + emit({ meta: { version: a.version, start: startIso, signatures: picked.length, note: 'OS/model are instance-sampled concentrations (capped), not exact totals' }, results: { items: [cols, ...items] } }, a.out); +} + +// "What crashes are GENUINELY NEW in this release?" — anti-join the new version's signatures against +// the UNION of several recent PRIOR versions (not just the immediate baseline). App Center's per-version +// firstOccurrence is the version's ROLLOUT date, NOT the signature's app-history first-seen, so a crash +// can show firstOccurrence INSIDE the window yet be many releases old (verified live: the okhttp +// http-cache journal IOException had firstOccurrence = rollout day on the new build but exists on every +// recent version). Only "absent from ALL listed priors within the 27-day API window AND present on the +// new build" earns "genuinely-new". Still-active prior versions keep throwing structural/environmental +// crashes, so the 27-day anti-join catches them; a defect introduced THIS release is absent from priors. +async function newCrashesMode(a, token) { + if (!a.priors) throw new Error('newcrashes needs --priors (recent prior versions to anti-join against)'); + const startIso = startIsoFromArgs(a); + const cap = pageCapFromArgs(a, 0); // exhaust by default — a "new" verdict must not miss a prior occurrence + const top = parseInt(a.top || '40', 10); + const minCount = parseInt(a['min-count'] || '5', 10); + const devNew = a['devices-new'] ? parseFloat(a['devices-new']) : null; + const priors = String(a.priors).split(',').map(s => s.trim()).filter(Boolean); + const [gNew, ...gPriorsArr] = await Promise.all([ + fetchGroups(a.owner, a.app, a.version, startIso, token, cap, !!a['include-hidden']), + ...priors.map(v => fetchGroups(a.owner, a.app, v, startIso, token, cap, !!a['include-hidden'])), + ]); + const mNew = aggBySig(gNew); + const priorHits = new Map(); // signature -> { versions:[], maxCount } + priors.forEach((v, i) => { + for (const [sig, e] of aggBySig(gPriorsArr[i])) { + let p = priorHits.get(sig); if (!p) { p = { versions: [], maxCount: 0 }; priorHits.set(sig, p); } + p.versions.push(v); p.maxCount = Math.max(p.maxCount, e.count); + } + }); + process.stderr.write(`new ${a.version}: ${mNew.size} signatures; priors [${priors.join(', ')}] contribute ${priorHits.size} signatures since ${startIso}\n`); + // A native crash whose only frame is a raw address (e.g. "0x1d0c37a8 + 481192") or a bare signal + // (SIGABRT/SIGSEGV/minidump) has a signature that DIFFERS across builds (the address is relocated + // per binary), so it ALWAYS anti-joins as "absent from priors" — a false genuinely-new. Tag those + // native/unsymbolized so the actionable JAVA-frame new crashes stand out; a native row needs OS/model + // + count corroboration (enrich), not the signature anti-join, to judge whether it is truly new. + const isNative = (label, type) => /^0x[0-9a-f]+\b/i.test(String(label || '')) || /SIG|minidump|native|SEGV|SIGABRT|ABRT|ILL_|BUS_|TRAP/i.test(String(type || '')); + const rows = [...mNew.values()].map(e => { + const hit = priorHits.get(e.label); + const newPer1k = devNew > 0 ? +(1000 * e.count / devNew).toFixed(3) : null; + const frameKind = isNative(e.label, e.exceptionType) ? 'native' : 'java'; + let verdict; + if (e.count < minCount) verdict = 'low-volume'; + else if (hit) verdict = 'pre-existing'; + else verdict = frameKind === 'native' ? 'new-native?' : 'genuinely-new'; + return { ...e, frameKind, newPer1k, priorVersionsHit: hit ? hit.versions.join(',') : '', maxPriorCount: hit ? hit.maxCount : 0, verdict }; + }); + // Actionable java-frame new crashes first, then native-suspect, then pre-existing, then low-volume. + const rank = v => v === 'genuinely-new' ? 0 : v === 'new-native?' ? 1 : v === 'pre-existing' ? 2 : 3; + rows.sort((x, y) => rank(x.verdict) - rank(y.verdict) || y.count - x.count); + const nNew = rows.filter(r => r.verdict === 'genuinely-new').length; + const nNative = rows.filter(r => r.verdict === 'new-native?').length; + process.stderr.write(` => ${nNew} genuinely-new java-frame signature(s) + ${nNative} native-unsymbolized suspect(s), >= ${minCount} crashes\n`); + const cols = ['label', 'exceptionType', 'appCodeFrame', 'exceptionMessage', 'firstSeenNew', 'newCount', 'newDevices', 'newPer1k', 'frameKind', 'priorVersionsHit', 'maxPriorCount', 'verdict']; + const items = rows.slice(0, top).map(r => [r.label, r.exceptionType, r.appCodeFrame, r.exceptionMessage, r.firstOccurrence, r.count, r.devices, r.newPer1k, r.frameKind, r.priorVersionsHit, r.maxPriorCount, r.verdict]); + emit({ meta: { version: a.version, priors, start: startIso, devicesNew: devNew, genuinelyNew: nNew, newNativeSuspect: nNative, note: 'genuinely-new = JAVA-frame signature absent from ALL listed priors within the 27-day API window. native/hex-frame rows (verdict new-native?) have build-unique signatures and ALWAYS anti-join as new — corroborate with enrich (OS/model + count), not the signature. firstSeenNew is the version ROLLOUT date, not app-history first-seen.' }, results: { items: [cols, ...items] } }, a.out); +} + +// Cross-version presence of ONE crash SIGNATURE (+ optional daily trend on the primary version): +// "is crash X specific to this release, or pre-existing across versions?" Pages each version's groups +// and aggregates every group whose codeRaw / class.method:line / exceptionMessage / exceptionType +// contains --match (case-insensitive). The primary version is --version; --priors is the comparison set. +async function signatureMode(a, token) { + if (!a.match) throw new Error('signature mode needs --match '); + const startIso = startIsoFromArgs(a); + const cap = pageCapFromArgs(a, 0); + const needle = String(a.match).toLowerCase(); + const hit = g => [g.codeRaw, appFrameOf(g), g.exceptionMessage, g.exceptionType].some(s => String(s || '').toLowerCase().includes(needle)); + const versions = [a.version, ...String(a.priors || '').split(',').map(s => s.trim()).filter(Boolean)]; + process.stderr.write(`signature "${a.match}" across ${versions.length} version(s) since ${startIso}\n`); + const rows = []; + let primaryId = null; + for (const v of versions) { + const gs = await fetchGroups(a.owner, a.app, v, startIso, token, cap, !!a['include-hidden']); + const matched = gs.filter(hit); + const count = matched.reduce((s, g) => s + (g.count || 0), 0); + const devices = matched.reduce((s, g) => s + (g.deviceCount || 0), 0); + const first = matched.map(g => g.firstOccurrence).filter(Boolean).sort()[0] || ''; + const last = matched.map(g => g.lastOccurrence).filter(Boolean).sort().slice(-1)[0] || ''; + const id = matched[0] ? matched[0].errorGroupId : ''; + if (v === a.version) primaryId = id; + rows.push([v, matched.length > 0 ? 'YES' : 'no', matched.length, count, devices, first, last, id]); + process.stderr.write(` ${v}: ${matched.length} group(s), ${count} crashes, ${devices} devices\n`); + } + let trend = null; + if ((a.trend === true || a.trend === 'true') && primaryId) { + try { + const d = await getJson(`${API}/apps/${a.owner}/${a.app}/errors/errorGroups/${primaryId}/errorCountsPerDay?version=${encodeURIComponent(a.version)}&start=${encodeURIComponent(startIso)}`, token); + trend = { ...trendOf(d.errors || []), series: (d.errors || []).map(e => [String(e.datetime).slice(0, 10), e.count]) }; + process.stderr.write(` trend(${a.version}): ${trend.trend} (peak ${trend.peakDay}, last ${trend.lastDay})\n`); + } catch (e) { process.stderr.write(` trend err: ${e.message}\n`); } + } + const cols = ['version', 'found', 'matchedGroups', 'count', 'devices', 'firstOccurrence', 'lastOccurrence', 'errorGroupId']; + emit({ meta: { match: a.match, primaryVersion: a.version, start: startIso, trend, note: 'firstOccurrence is the version ROLLOUT date, not app-history first-seen' }, results: { items: [cols, ...rows] } }, a.out); +} + +async function main() { + const a = parseArgs(process.argv.slice(2)); + const mode = a._[0]; + if (!mode || !a.owner || !a.app || !a.version) { + console.error('usage: fetch-appcenter-crashes.js --owner --app --version [--base ] [--priors ] [--match ] [--trend] [--days 14] [--top N] [--min-count 5] [--page-cap N|0] [--devices-new N] [--include-hidden] [--out f.json]'); + process.exit(2); + } + const token = resolveToken(a); + if (mode === 'groups') await groupsMode(a, token); + else if (mode === 'diff') await diffMode(a, token); + else if (mode === 'enrich') await enrichMode(a, token); + else if (mode === 'newcrashes') await newCrashesMode(a, token); + else if (mode === 'signature') await signatureMode(a, token); + else { console.error('unknown mode: ' + mode); process.exit(2); } +} +main().catch(e => { console.error('ERROR: ' + e.message); process.exit(1); }); diff --git a/.github/skills/release-monitoring-report/assets/scripts/find-suspect-prs.ps1 b/.github/skills/release-monitoring-report/assets/scripts/find-suspect-prs.ps1 new file mode 100644 index 00000000..6a263353 --- /dev/null +++ b/.github/skills/release-monitoring-report/assets/scripts/find-suspect-prs.ps1 @@ -0,0 +1,260 @@ +<# +.SYNOPSIS + Find candidate PRs touching a class / file / method, across broker/, common/ and the + authenticator/ app repo in parallel. Supports BOTH a date window (weekly-style) AND a + release version range (release-style). + +.DESCRIPTION + Given a symbol (or arbitrary regex), runs `git log -S` (pickaxe, diff-content), optionally + `git log -G` (diff-text regex) AND `git log --grep` (commit subject) against the selected + repos, then prints a unified table sorted by date. + + Two windowing modes: + * Date window — default; -Since / -Until (weekly on-call use). + * Version range — pass -Range 'v16.1.0..v16.2.0' (broker tags) or '6.2606.3817..6.2606.4029' + (authenticator tags) to correlate exactly the commits that shipped between two releases + (release-monitoring use). When -Range is set, -Since / -Until are ignored. Each repo is + resolved by trying the range endpoints as its OWN refs first (works for broker and the + authenticator app, which each carry the relevant tags); a repo that lacks the tags but is + pinned as a broker submodule (common) is mapped via the broker submodule pointer; otherwise + it is skipped with a warning (never silent). + + Two attribution flavours: + * Server-returned auth errors (invalid_grant / interaction_required) — the trigger is in + broker/ + common/ (the default repo set). Weight device-PoP / PRT / cache-path changes. + * Authenticator crashes (new or rising App Center signatures) — search the owning repo with + -Repos authenticator and a Range of the bundled app tags. CRITICAL: a crash frame names the + object being inspected (the VICTIM), which is often NOT the file that broke. The culprit is + usually a CALLER that passes that object into a failing API. So set -Symbol to the + exception/API token from the stack (e.g. 'EntryPoints.get', 'GeneratedComponent'), NOT the + crashing class — the pickaxe then finds the caller that introduced the bad call. ALWAYS also + pass -DiffGrep with the same token, because a culprit PR's SUBJECT almost never mentions the + subsystem it broke (e.g. a "TOTP Secret Fix" PR that added a Hilt EntryPoints.get call) so + --grep alone misses it. Verified: searching -Symbol MfaAuthDialogActivity found nothing, but + -Symbol 'EntryPoints.get' -DiffGrep 'EntryPoints' surfaced the real culprit on the first try. + + Use this AFTER reading the full `git log ` (ranges between two releases are small) + and identifying the suspect code path. + +.PARAMETER Symbol + String to search for in commit diffs (passed to `git log -S`). Typically the class / method on + the suspect path, e.g. 'AbstractDevicePopManager', 'generateAsymmetricKey'. For a CRASH, use the + exception/API token from the stack (e.g. 'EntryPoints.get'), not the crashing class — the culprit + is the caller that passes the crashing class into the failing API. + +.PARAMETER GrepRegex + Optional regex for `git log --grep` (commit SUBJECT only — low recall; a PR rarely names the + subsystem it breaks). Omit to skip the subject search. Prefer -DiffGrep for crashes. + +.PARAMETER DiffGrep + Optional regex for `git log -G` (matches the DIFF TEXT, not just the subject). Use this for crash + attribution so a culprit whose subject never mentions the broken subsystem is still found. + +.PARAMETER Range + Git revision range, e.g. 'v16.1.0..v16.2.0'. When set, overrides -Since / -Until. + +.PARAMETER Since + Inclusive start date (yyyy-MM-dd). Defaults to 28 days ago. Ignored if -Range is set. + +.PARAMETER Until + Inclusive end date. Defaults to today. Ignored if -Range is set. + +.PARAMETER RepoRoot + Root folder containing `broker/`, `common/` and `authenticator/` subfolders. Defaults to the + git top-level of the current working directory (so running from any clone of android-complete + works). + +.PARAMETER Repos + Which repos to search. Defaults to broker + common (the auth-code attribution set). For crash + attribution pass -Repos authenticator (the crashing frame's owning repo). Accepts any subset of + broker, common, authenticator. + +.EXAMPLE + .\find-suspect-prs.ps1 -Symbol AbstractDevicePopManager -Range v16.1.0..v16.2.0 + +.EXAMPLE + .\find-suspect-prs.ps1 -Symbol generateAsymmetricKey -Since 2026-05-01 -Until 2026-06-19 + +.EXAMPLE + # Authenticator crash attribution — search the EXCEPTION TOKEN from the stack (not the crashing + # class) and use -DiffGrep so a culprit with an unrelated subject is still found. This exact + # search surfaced PR 15896454 ("TOTP Secret Fix") as the culprit for the dagger.hilt crash on + # MfaAuthDialogActivity, which a -Symbol MfaAuthDialogActivity search had completely missed: + .\find-suspect-prs.ps1 -Repos authenticator -Range 6.2606.3817..6.2606.4029 ` + -Symbol 'EntryPoints.get' -DiffGrep 'EntryPoints|GeneratedComponent' + +.NOTES + Cites repos with the URL patterns: broker -> ad-accounts-for-android (GitHub PR), + common -> microsoft-authentication-library-common-for-android (GitHub PR), + authenticator -> AD-MFA-phonefactor-phoneApp-android (ADO pullrequest; PR # parsed from the + "Merged PR NNNNNNNN:" commit-subject convention). +#> +[CmdletBinding()] +param( + [Parameter(Mandatory=$true)][string]$Symbol, + [string]$GrepRegex, + [string]$DiffGrep, + [string]$Range, + [string]$Since = (Get-Date).AddDays(-28).ToString('yyyy-MM-dd'), + [string]$Until = (Get-Date).ToString('yyyy-MM-dd'), + [string]$RepoRoot, + [string[]]$Repos +) + +# Resolve repo root: explicit -RepoRoot wins; otherwise discover via `git rev-parse --show-toplevel`. +if (-not $RepoRoot) { + $gitRoot = (git rev-parse --show-toplevel 2>$null) + if ($gitRoot) { + $RepoRoot = $gitRoot.Trim() + } else { + $RepoRoot = (Join-Path $env:USERPROFILE 'Repos\android-complete') + Write-Warning "Not inside a git working tree. Falling back to legacy default: $RepoRoot. Pass -RepoRoot explicitly to silence this." + } +} + +if (-not $GrepRegex) { $GrepRegex = [regex]::Escape($Symbol) } + +$repoDefs = @( + @{ Name='broker'; Path=(Join-Path $RepoRoot 'broker'); UrlBase='https://github.com/identity-authnz-teams/ad-accounts-for-android/pull/' } + @{ Name='common'; Path=(Join-Path $RepoRoot 'common'); UrlBase='https://github.com/AzureAD/microsoft-authentication-library-common-for-android/pull/' } + @{ Name='authenticator'; Path=(Join-Path $RepoRoot 'authenticator'); UrlBase='https://msazure.visualstudio.com/One/_git/AD-MFA-phonefactor-phoneApp-android/pullrequest/'; Ado=$true } +) + +# Which repos to search. Default = broker + common (auth-code attribution); authenticator is +# only searched when explicitly requested (-Repos authenticator) so the broker/common auth-code +# flow never noisily probes the app repo with a broker-tag range. +if ($Repos -and $Repos.Count -gt 0) { + $unknown = @($Repos | Where-Object { $_ -notin $repoDefs.Name }) + if ($unknown.Count -gt 0) { + Write-Error "Unknown -Repos value(s): $($unknown -join ', '). Known: $($repoDefs.Name -join ', ')." + exit 2 + } + $repoDefs = @($repoDefs | Where-Object { $Repos -contains $_.Name }) +} else { + $repoDefs = @($repoDefs | Where-Object { $_.Name -in @('broker','common') }) +} + +# FAIL LOUDLY if none of the requested subrepos exist under the resolved root. +$availableRepos = @($repoDefs | Where-Object { Test-Path $_.Path }) +if ($availableRepos.Count -eq 0) { + Write-Error @" +None of the requested repos ($($repoDefs.Name -join ', ')) found under -RepoRoot $RepoRoot. + +Expected layout: + $RepoRoot\broker\ (clone of identity-authnz-teams/ad-accounts-for-android) + $RepoRoot\common\ (clone of AzureAD/microsoft-authentication-library-common-for-android) + $RepoRoot\authenticator\ (clone of msazure One/AD-MFA-phonefactor-phoneApp-android) + +Pass -RepoRoot pointing at the parent of those clones. The android-complete mono-repo +at the repo root works because broker/, common/ and authenticator/ are submodules there. +"@ + exit 2 +} +if ($availableRepos.Count -lt $repoDefs.Count) { + $missing = $repoDefs | Where-Object { -not (Test-Path $_.Path) } | ForEach-Object { $_.Name } + Write-Warning "Skipping $($missing -join ', ') — not found under $RepoRoot. Results will be incomplete." +} + +# Build the per-repo windowing args. For -Range, a repo is resolved by trying the range +# endpoints as its OWN refs first: broker (v16.x tags) and the authenticator app +# (6.xxxx.xxxx tags) both carry the relevant release tags, so they use the range directly. +# A repo that lacks the tags but is pinned as a broker submodule (common — which reuses the +# v16.x numbers for an unrelated older namespace) is mapped to the SHA the broker tree pins +# for it at each endpoint, so it scans exactly the commits that shipped in that broker release. +$brokerPath = (Join-Path $RepoRoot 'broker') +function Get-WindowArgs($repo) { + if (-not $Range) { return @("--since=$Since", "--until=$Until") } + + $ends = $Range -split '\.\.', 2 + if ($ends.Count -ne 2) { Write-Warning "Malformed -Range '$Range'."; return @() } + + # 1) If BOTH endpoints resolve as this repo's own refs (tags/commits), use the range directly. + $bothLocal = $true + foreach ($e in $ends) { + if ($e -and -not (git -C $repo.Path rev-parse --verify --quiet "$e^{commit}" 2>$null)) { $bothLocal = $false; break } + } + if ($bothLocal) { return @($Range) } + + # 2) Otherwise map each broker tag -> the SHA the broker tree pins for this repo as a submodule. + if (-not (Test-Path $brokerPath)) { + Write-Warning "broker/ not found at $brokerPath; cannot translate '$Range' to the $($repo.Name) submodule range. Skipping $($repo.Name)." + return @() + } + $subName = $repo.Name # submodule directory name inside the broker tree (e.g. 'common') + $sha = @() + foreach ($e in $ends) { + $entry = git -C $brokerPath ls-tree $e $subName 2>$null + if ($entry -match '160000 commit ([0-9a-f]{40})') { + $s = $Matches[1] + if (-not (git -C $repo.Path rev-parse --verify --quiet "$s^{commit}" 2>$null)) { + Write-Warning "Submodule SHA $s (from broker $e) not present in $($repo.Path) — run 'git -C $($repo.Path) fetch'. Skipping $($repo.Name)." + return @() + } + $sha += $s + } else { + Write-Warning "Could not read '$subName' submodule pointer at broker '$e'. Skipping $($repo.Name)." + return @() + } + } + return @("$($sha[0])..$($sha[1])") +} + +$results = @() +foreach ($r in $availableRepos) { + $winArgs = @(Get-WindowArgs $r) + if ($winArgs.Count -eq 0) { continue } + Push-Location $r.Path + try { + # Pickaxe: PRs whose diff added or removed the symbol (content-level — finds the CALLER + # that introduced/removed a reference, which for a crash is usually the culprit, not the + # crashing class itself). Set -Symbol to the exception/API token from the stack. + $pickaxeRaw = git log @winArgs -S $Symbol --pretty=format:'%h|%ai|%an|%s' 2>$null + # Diff-content grep (-G): PRs whose DIFF text matches the regex anywhere — catches a culprit + # whose subject never mentions the broken subsystem (the #1 reason --grep misses a crash PR). + $diffGrepRaw = if ($DiffGrep) { git log @winArgs -G $DiffGrep --pretty=format:'%h|%ai|%an|%s' 2>$null } else { @() } + # Grep: PRs whose SUBJECT mentions the regex (case-insensitive) — subject-only, low recall. + $grepRaw = if ($GrepRegex) { git log @winArgs --pretty=format:'%h|%ai|%an|%s' --grep=$GrepRegex -i 2>$null } else { @() } + + $seen = @{} + foreach ($line in @($pickaxeRaw, $diffGrepRaw, $grepRaw | Where-Object { $_ })) { + foreach ($l in @($line)) { + if (-not $l) { continue } + $parts = $l -split '\|', 4 + if ($parts.Count -lt 4) { continue } + $sha = $parts[0] + if ($seen.ContainsKey($sha)) { continue } + $seen[$sha] = $true + # Pull the PR number from the subject. GitHub squash-merges read "...(#NNN)"; + # ADO merges read "Merged PR NNNNNNNN: " (8-digit PR ids, no '#'). + $prNum = $null + if ($r.Ado) { + if ($parts[3] -match 'Merged PR (\d+)') { $prNum = $Matches[1] } + } else { + if ($parts[3] -match '#(\d{2,5})\b') { $prNum = $Matches[1] } + } + $prLabel = if (-not $prNum) { '' } elseif ($r.Ado) { 'PR ' + $prNum } else { '#' + $prNum } + $results += [pscustomobject]@{ + Repo = $r.Name + Date = $parts[1].Substring(0, 10) + Author = $parts[2] + Sha = $sha + PR = $prLabel + Url = if ($prNum) { $r.UrlBase + $prNum } else { '' } + Subject = $parts[3] + } + } + } + } finally { Pop-Location } +} + +$windowLabel = if ($Range) { "range $Range" } else { "window $Since .. $Until" } +if ($results.Count -eq 0) { + Write-Host "No PRs match in $windowLabel for symbol '$Symbol'." + Write-Host " Tip: try a shorter symbol (just the class name), or widen the window/range." + exit 0 +} + +$results | Sort-Object Date -Descending | Format-Table Repo, Date, Author, Sha, PR, @{n='Subject';e={$_.Subject.Substring(0, [Math]::Min(80, $_.Subject.Length))}} -AutoSize +Write-Host "" +Write-Host "PR URLs for citation in attribution cards ($windowLabel):" +$results | Where-Object Url | Sort-Object Date -Descending | ForEach-Object { Write-Host " $($_.Repo) $($_.PR): $($_.Url)" } diff --git a/.github/skills/release-monitoring-report/assets/scripts/run-kql.ps1 b/.github/skills/release-monitoring-report/assets/scripts/run-kql.ps1 new file mode 100644 index 00000000..38f9980a --- /dev/null +++ b/.github/skills/release-monitoring-report/assets/scripts/run-kql.ps1 @@ -0,0 +1,101 @@ +<# +.SYNOPSIS + Direct-REST Kusto query helper. Drop-in fallback for the Azure Kusto MCP server + when the MCP times out (the MCP has a 240 s budget and frequently exceeds it on + the per-error-code queries this skill needs). + +.DESCRIPTION + Acquires an Entra token via the local `az` CLI for the Kusto cluster, POSTs the + query to /v2/rest/query, and writes a JSON file whose schema matches what the + other helpers in this skill (compare-versions.js) expect: + + { "results": { "items": [ + [colName0, colName1, ...], // first row = column-name list + [row0col0, row0col1, ...], + [row1col0, row1col1, ...], + ... + ] } } + + `compare-versions.js` reads this array-form schema directly — no transformer step needed. + +.PARAMETER Query + KQL query text. Pass via single-quoted PowerShell here-string for safety. + +.PARAMETER Out + Output JSON file path. + +.PARAMETER Cluster + Kusto cluster URI (default: idsharedeus2 — the production Android Broker cluster). + +.PARAMETER Database + Database name (default: ad-accounts-android-otel). + +.PARAMETER TimeoutSec + HTTP timeout (default 300 s — Kusto itself has a 5-minute server-side query budget). + +.EXAMPLE + # Sanity check + .\run-kql.ps1 -Query 'print x=1' -Out test.json + +.EXAMPLE + # Pull the 60-day per-error-code trend + $q = @" +materialized_view('ErrorStatsMetrics') +| where EventInfo_Time between (datetime(2026-04-12) .. datetime(2026-06-07)) +| where isnotempty(error_code) and error_code != 'success' +| summarize errs = sum(countOverall), + devs = dcount_hll(hll_merge(countDevicesHll)) + by week = startofweek(EventInfo_Time), error_code +| where week < datetime(2026-06-07) +| order by error_code asc, week asc +"@ + .\run-kql.ps1 -Query $q -Out 60d-codes.json + +.NOTES + * Requires `az login` to have been run beforehand and the caller to have read + access to the cluster (Android Auth Client SDK security group). + * Runs queries in parallel from PowerShell jobs — see SKILL.md for the + "pull-many-in-parallel" pattern. + * If your query payload is large (>50 KB returned), the JSON file may itself + be large — pass it to compare-versions.js rather than viewing in-band. +#> +[CmdletBinding()] +param( + [Parameter(Mandatory=$true)][string]$Query, + [Parameter(Mandatory=$true)][string]$Out, + [string]$Cluster = 'https://idsharedeus2.kusto.windows.net', + [string]$Database = 'ad-accounts-android-otel', + [int]$TimeoutSec = 300 +) +$ErrorActionPreference = 'Stop' + +# Acquire token via az CLI (works for users + managed identity) +$tok = az account get-access-token --resource $Cluster --query accessToken -o tsv 2>$null +if (-not $tok) { + throw "Failed to acquire token for $Cluster. Run 'az login' first and verify membership in the Android Auth Client SDK security group." +} + +$body = @{ csl = $Query; db = $Database } | ConvertTo-Json -Compress +$resp = Invoke-RestMethod -Uri "$Cluster/v2/rest/query" -Method Post ` + -Headers @{ Authorization = "Bearer $tok"; 'Content-Type' = 'application/json' } ` + -Body $body -TimeoutSec $TimeoutSec + +# Find the PrimaryResult table (Kusto returns multiple frame types; we want the data) +$primary = $resp | Where-Object { $_.FrameType -eq 'DataTable' -and $_.TableKind -eq 'PrimaryResult' } | Select-Object -First 1 +if (-not $primary) { + # Surface any error frames so the caller can see what went wrong + $err = $resp | Where-Object { $_.FrameType -eq 'DataSetCompletion' -and $_.HasErrors } | Select-Object -First 1 + if ($err) { throw "Kusto query failed with errors. Full response:`n$($resp | ConvertTo-Json -Depth 6)" } + throw 'No PrimaryResult table in response' +} + +# Convert to the canonical schema the JS helpers expect +$colNames = @($primary.Columns | ForEach-Object { $_.ColumnName }) +$items = New-Object System.Collections.ArrayList +[void]$items.Add($colNames) +foreach ($r in $primary.Rows) { [void]$items.Add($r) } + +$obj = @{ results = @{ items = $items } } +# UTF-8 without BOM — keeps emoji/diacritic data clean for downstream consumption +[IO.File]::WriteAllText($Out, ($obj | ConvertTo-Json -Depth 12 -Compress), [System.Text.UTF8Encoding]::new($false)) +Write-Host ("Saved {0} rows -> {1}" -f ($primary.Rows.Count), $Out) diff --git a/.github/skills/release-monitoring-report/assets/scripts/validate-report.ps1 b/.github/skills/release-monitoring-report/assets/scripts/validate-report.ps1 new file mode 100644 index 00000000..0a27185c --- /dev/null +++ b/.github/skills/release-monitoring-report/assets/scripts/validate-report.ps1 @@ -0,0 +1,116 @@ +<# +.SYNOPSIS + Validate a filled release-monitoring report before it ships. + +.DESCRIPTION + A LEAN structural/QA check (the oncall validator is heavier and tuned to its attribution + cards). Checks that catch the mistakes that actually happen when an agent edits the + template in place: + + ERRORS (exit 1): + - Leftover template tokens: '{{', 'EXAMPLE CONTENT', literal '<FIRST>' / '<SECOND>' + / '<TOKEN>' / '<bv>' / '<av>' placeholders OUTSIDE the leading HTML comment. + - U+FFFD replacement char (mojibake from a bad heredoc round-trip). + - Raw device/request counts leaking into user-facing prose + ("585300000 devices" style) — should be humanized (585.3M). + - A version under test was supplied but its string never appears in the body. + - An app section was supposed to be present but its <section> anchor is missing. + - A leaked App Center secret: an 'X-API-Token' header, 'APPCENTER_API_TOKEN', or a + reference to the 'appcenter.token' file must never appear in a shipped report. + + WARNINGS (exit 0, but printed): + - KPI tiles with no data-spark / data-trend attribute (soft — release KPIs are + 2-point compares, not series, so this is informational only). + - 'Generated <strong>...</strong>' still at the template's placeholder date. + - Fewer than 2 verdict callouts (.callout) — a release report should state a verdict. + - AuthVersion given but no #auth-stability section (crash layer skipped — fine if no + App Center token was available). + - AuthVersion given but no #auth-broker section (broker-via-Authenticator health skipped). + - A 40-hex-char string that could be an App Center token (verify it is not a secret). + +.PARAMETER Path + Path to the report HTML to validate. Required. + +.PARAMETER BrokerVersion + If provided, asserts the string appears in the body and the broker section is present. + +.PARAMETER AuthVersion + If provided, asserts the string appears in the body and the authenticator section is present. + +.EXAMPLE + .\validate-report.ps1 -Path ~/android-release-reports/release-report-broker-16.1.0-auth-6.2606.3817-2026-01-15.html ` + -BrokerVersion 16.1.0 -AuthVersion 6.2606.3817 +#> +[CmdletBinding()] +param( + [Parameter(Mandatory)][string]$Path, + [string]$BrokerVersion, + [string]$AuthVersion +) +$ErrorActionPreference = 'Stop' +if (-not (Test-Path $Path)) { throw "Report not found: $Path" } + +$text = [IO.File]::ReadAllText($Path) +$errors = New-Object System.Collections.Generic.List[string] +$warns = New-Object System.Collections.Generic.List[string] + +# Strip the leading template HTML comment so its <bv>/<av>/<date> mentions don't false-positive. +$body = [regex]::Replace($text, '(?s)^.*?-->', '', 1) + +# --- ERRORS --- +foreach ($tok in @('{{', 'EXAMPLE CONTENT', '<FIRST>', '<SECOND>', '<TOKEN>', '<bv>', '<av>', '<date>')) { + if ($body.Contains($tok)) { $errors.Add("Leftover template token '$tok' in body.") } +} +if ($body.Contains([char]0xFFFD)) { $errors.Add("U+FFFD replacement char present (mojibake) — re-write file as UTF-8 no BOM.") } + +# Raw long counts in prose: a run of >=7 digits immediately followed by ' device' / ' request' / ' req' +$rawCount = [regex]::Matches($body, '(?<![\d.])\d{7,}(?=\s*(?:devices?|requests?|reqs?)\b)', 'IgnoreCase') +if ($rawCount.Count -gt 0) { + $sample = ($rawCount | Select-Object -First 3 | ForEach-Object { $_.Value }) -join ', ' + $errors.Add("Raw un-humanized count(s) in prose ($sample ...). Use M/K (e.g. 585.3M).") +} + +if ($BrokerVersion) { + if ($body -notmatch [regex]::Escape($BrokerVersion)) { $errors.Add("BrokerVersion '$BrokerVersion' never appears in body.") } + if ($body -notmatch '(?i)id="broker|class="[^"]*broker') { $warns.Add("No broker section anchor (id/class 'broker') found.") } +} +if ($AuthVersion) { + if ($body -notmatch [regex]::Escape($AuthVersion)) { $errors.Add("AuthVersion '$AuthVersion' never appears in body.") } + if ($body -notmatch '(?i)id="auth|class="[^"]*auth') { $warns.Add("No authenticator section anchor (id/class 'auth') found.") } + if ($body -notmatch 'id="auth-stability"') { $warns.Add("No #auth-stability section — Authenticator crash/stability layer not included (OK only if no App Center token was available).") } + if ($body -notmatch 'id="auth-broker"') { $warns.Add("No #auth-broker section — broker-via-Authenticator (active-broker) health not included; add it to attribute broker movement to this app rollout.") } +} + +# Leaked App Center secret — must never ship inside a report. +if ($body -match '(?i)X-API-Token') { $errors.Add("'X-API-Token' header text present — an App Center secret may have leaked into the report.") } +if ($body -match '(?i)APPCENTER_API_TOKEN'){ $errors.Add("'APPCENTER_API_TOKEN' present — remove any token/secret references from the report.") } +if ($body -match '(?i)appcenter\.token') { $errors.Add("Reference to the secret token file 'appcenter.token' present — remove it from the report.") } + +# --- WARNINGS --- +$kpiCount = ([regex]::Matches($body, 'class="kpi"')).Count +$sparkCount = ([regex]::Matches($body, 'data-spark|data-trend')).Count +if ($kpiCount -gt 0 -and $sparkCount -eq 0) { + $warns.Add("$kpiCount KPI tile(s) but no data-spark/data-trend attributes (soft — OK for 2-point release compares).") +} +if ($body -match 'Generated\s+<strong>\s*2026-01-01\s*</strong>') { + $warns.Add("Generated date still at the template placeholder (2026-01-01) — bootstrap should have stamped it.") +} +$callouts = ([regex]::Matches($body, 'class="callout')).Count +if ($callouts -lt 2) { $warns.Add("Only $callouts verdict callout(s) — a release report should state a clear verdict per app.") } + +# Possible leaked token: a standalone 40-hex-char run (App Center tokens are 40 chars). Advisory — +# could also be a git SHA, so warn rather than fail. +$hex40 = [regex]::Matches($body, '(?<![A-Fa-f0-9])[A-Fa-f0-9]{40}(?![A-Fa-f0-9])') +if ($hex40.Count -gt 0) { $warns.Add("$($hex40.Count) 40-hex-char string(s) present — verify none is an App Center API token (secret).") } + +# --- report --- +"Validating: $Path" +" KPI tiles: $kpiCount callouts: $callouts spark/trend attrs: $sparkCount" +if ($warns.Count) { ""; "WARNINGS:"; $warns | ForEach-Object { " ! $_" } } +if ($errors.Count) { + ""; "ERRORS:"; $errors | ForEach-Object { " X $_" } + ""; "FAILED with $($errors.Count) error(s)." + exit 1 +} +""; "PASSED" + $(if ($warns.Count) { " with $($warns.Count) warning(s)." } else { "." }) +exit 0 diff --git a/.github/skills/release-monitoring-report/assets/templates/report-template.html b/.github/skills/release-monitoring-report/assets/templates/report-template.html new file mode 100644 index 00000000..8c757061 --- /dev/null +++ b/.github/skills/release-monitoring-report/assets/templates/report-template.html @@ -0,0 +1,924 @@ +<!DOCTYPE html> +<!-- TEMPLATE - see assets/templates/template-readme.md for usage. This file IS a real + prior release-comparison report kept as a structural + visual reference. To produce + a new report, run assets/scripts/bootstrap-report.ps1 (it copies this file to + ~/android-release-reports/release-report-broker-<bv>-auth-<av>-<date>.html), then edit + IN PLACE: replace the version strings, window dates, KPI values, table rows, verdict + callout prose, and the appendix. The CSS in <head> is canonical - do not restyle. + Numbers below are REAL (pulled while authoring the skill) and illustrate the intended + narrative: a Broker release with an error-share regression but a latency win, and a + clean flat Authenticator release. Validate with assets/scripts/validate-report.ps1. --> +<html lang="en"> +<head> +<meta charset="UTF-8"> +<title>Android Release Monitoring · Broker 16.1.0 · Authenticator 6.2606.3817 + + + +
+ +
+
+

Android Release Monitoring

+
+ Broker 16.1.0 vs 16.0.1 +  ·  Authenticator 6.2606.3817 vs 6.2605.3042 +  ·  Window: last 14 days  ·  + Source: ad-accounts-android-otel + Authenticator OTEL  ·  + Generated 2026-01-01 +
+
+ Template · Live data +
+ +
+ + +

🚦 Release verdict

+ +
+
⚠️ Broker 16.1.0 — HOLD wide rollout
+

16.1.0 is in early rollout (76.4M devices vs 585.3M on 16.0.1). Two signals disagree:

+
    +
  • Regression: overall device error-share is elevated — io_error reaches 82.1% of 16.1.0 devices vs 41.8% on 16.0.1 (+40.3 pp). This dominates the headline device-error-rate gap and must be understood before widening.
  • +
  • Win: latency improved at every percentile through P95 (P95 3,117 ms vs 4,712 ms, −33.8%). Only P99 is slightly worse (+2,064 ms).
  • +
+

Early-rollout cohorts skew toward upgrade/network churn, so part of the io_error lift may be transient — confirm the trend as adoption grows before shipping or rolling back.

+
+ +
+
Authenticator 6.2606.3817 — SAFE to proceed
+

6.2606.3817 is flat against 6.2605.3042 across every measured scenario — all deltas within ±1 pp, no regression. Volumes on the new build are small but representative (5.69M devices). Stability is a clear win: the crash rate fell 6.49 → 0.86 crashes per 1k active devices (−87%), driven mostly by the near-elimination of a tampered/sideloaded-APK ClassNotFoundException (−1.65/1k alone). No genuine per-device crash regression exists. The broker hosted inside this app (16.2.0) shows no broad regression versus the broker in 6.2605 (16.1.0) at the device level, but a span drill-down finds per-request invalid_grant / interaction_required on the silent path running ~0.4 pp above baseline (mostly a decaying upgrade spike) — a watch item, not a blocker. The app rollout itself is safe to proceed; monitor the silent path as the cohort matures.

+
+ + +

📊 Broker health — 16.1.0 vs 16.0.1

+
+
+
Device error rate
+
58.71%
+
+36.62 pp vs 16.0.1 (22.09%)
+
+
+
Silent reliability (dev)
+
93.63%
+
−3.66 pp vs 16.0.1 (97.28%)
+
+
+
Interactive reliability (dev)
+
69.73%
+
+16.18 pp vs 16.0.1 (53.55%)
+
+
+
Silent reliability (req)
+
44.97%
+
−42.14 pp vs 16.0.1 (87.11%)
+
+
+
p95 latency
+
3,117 ms
+
−1,595 ms (−33.8%) vs 16.0.1
+
+
+
Devices on 16.1.0
+
76.4 M
+
13.1% of 16.0.1 fleet (early rollout)
+
+
+
+ Reading the deltas: red = worse on the new release, green = better. "pp" = percentage points. Reliability = requests/devices without an expected error. Device error rate = devices hitting any non-success error_code ÷ total devices on that version (a broad denominator — use the error-movers table below to see which code drives it). +
+ + +

🔎 Broker error movers — device-share by error_code

+

Per error_code, the share of each version's devices that hit it. Sorted by the 16.1.0−16.0.1 delta — the codes "changing this release". Device share normalizes for the rollout size difference, so growth here is real, not just a bigger cohort.

+
+ + + + + + + + + + + + + + + + + + +
error_code16.0.1 dev-share16.1.0 dev-shareΔ pp16.1.0 devices
io_error41.79%82.09%+40.3036.8 M
invalid_grant25.17%31.98%+6.8214.3 M
timed_out_execution21.11%26.22%+5.1111.8 M
device_network_not_available_doze_mode0.62%4.16%+3.551.9 M
no_tokens_found2.56%5.60%+3.042.5 M
interaction_required1.08%2.64%+1.571.2 M
unauthorized_client0.88%2.11%+1.230.9 M
timed_out0.26%0.82%+0.560.4 M
+
+
Why this matters: io_error alone accounts for the bulk of the headline device-error-rate gap. Drill into its error_message / span_name distribution on 16.1.0 next (filter ErrorStatsMetrics by broker_version == "16.1.0" and error_code == "io_error") to attribute it to a code path or a transient network class.
+ + +

⏱️ Broker latency — responseTime percentiles

+
+ + + + + + + + + + + + + + +
Percentile16.0.116.1.0Δ
P50462 ms272 ms−190 ms
P751,328 ms760 ms−568 ms
P902,825 ms1,781 ms−1,044 ms
P954,712 ms3,117 ms−1,595 ms
P9928,035 ms30,099 ms+2,064 ms
+
+
Latency improved across the board through P95; only the P99 tail is worse on 16.1.0. Percentiles are merged via tdigest_merge — never summed.
+ + +

🚚 Broker adoption — distinct devices by version

+
+ + + + + + + + + +
broker_versionDevicesNote
16.0.1585.3 Mprior release (baseline)
16.1.076.4 Mrolling out
16.2.011.9 Mnext train (canary)
14.2.09.9 Mlong tail
15.0.02.8 Mlong tail
+
+ + +

📊 Authenticator health — 6.2606.3817 vs 6.2605.3042

+
+
+
MFA PN completion
+
76.73%
+
+0.06 pp vs 6.2605 (76.67%)
+
+
+
MFA QR reg success
+
74.73%
+
−0.27 pp vs 6.2605 (75.00%)
+
+
+
PSI reg success
+
68.03%
+
−0.84 pp vs 6.2605 (68.87%)
+
+
+
Passkey WebAuthN reg
+
80.63%
+
−0.31 pp vs 6.2605 (80.94%)
+
+
+
MSA NGC reg success
+
20.15%
+
+0.20 pp vs 6.2605 (19.95%)
+
+
+
Devices on 6.2606.3817
+
5.69 M
+
newest train build
+
+
+ + +

🔎 Authenticator scenarios — per-scenario success / completion

+
+ + + + + + + + + + + + + + + + + + + + + +
ScenarioMetric6.2605.30426.2606.3817Δ pp6.2606 volumeVerdict
Entra MFA · PN + CheckForAuthcompletion76.67%76.73%+0.069.9 Mflat
Entra MFA · QR registrationsuccess75.00%74.73%−0.27135.9 Kflat
Entra PSI · registrationsuccess68.87%68.03%−0.8442.1 Kflat
Passkey · WebAuthN registrationsuccess80.94%80.63%−0.31852low-volume
MSA NGC · registrationsuccess19.95%20.15%+0.2099.2 Kflat
MSA SA · registrationsuccess15.45%14.74%−0.716.5 Kflat
Passkey · InApp registrationsuccess75.11%73.50%−1.6111.2 Klow-volume
Passkey · WebAuthN sign-insuccess94.87%94.50%−0.37181.4 Kflat
Entra MFA · No-QR registrationsuccess47.39%48.63%+1.2464.9 Kflat
+
+
Low MSA registration success (≈15–20%) is a known steady-state characteristic of that scenario, not a 6.2606 regression — the point of a release report is the delta, which is flat. The No-QR MFA flow is similarly low-success by nature (manual entry). Verify volume before trusting a small-percentage move (e.g. the 852-initiate Passkey reg and 11.2K InApp rows are noise-prone).
+ +

Reacted-notification outcome split — Entra MFA. Of the pushes a user actually reacted to, the share they Approved / Denied / hit an Error on. A release that broke the approve path or spiked fraud-denials would surface here even when headline completion looks flat.

+
+ + + + + + + + + + + + +
App versionReacted devicesApprovedDeniedError
6.2605.3042 (prev)144.0 M97.12%1.24%1.64%
6.2606.3817 (new)6.68 M97.23%1.18%1.60%
+
+
Approve quality is stable: on 6.2606 the approval rate edged up (+0.11 pp) while both denials (−0.06 pp) and reaction errors (−0.04 pp) edged down — no sign the new build degraded the MFA approve path or triggered more fraud-denials. Counts are device-deduped reactions; 6.68 M on the new build is early-rollout volume.
+ + +

🐞 Authenticator stability — crashes per 1k active devices

+

Crash clusters from App Center Diagnostics (numerator), normalized by Kusto active devices per version (denominator). The headline is the per-device rate, not crash-share: a signature can take a bigger share of a much smaller crash pool while its real per-device rate drops — so share alone would invent regressions that don't exist.

+
+
+
Crash rate / 1k active dev
+
0.86
+
−87% vs 6.49 on 6.2605
+
+
+
Crashes (App Center · 14d)
+
10.8 K
+
over 12.62M active devices
+
+
+
Distinct crash signatures
+
683
+
App Center error groups
+
+
+
Genuine rate regressions
+
0
+
no signature ≥ 0.02/1k worse
+
+
+
Top driver of the drop
+
−1.65/1k
+
tampered-APK ClassNotFoundException
+
+
+
New long-tail clusters
+
344
+
all < 0.02/1k (early-window noise)
+
+
+
+ + + + + + + + + + + + + + + + + + + + +
Crashing frameException6.2605 /1k6.2606 /1kΔ /1k6.2606 shareStatus
appcenter.crashes.Crashes.saveUncaughtExceptionRemoteException (SDK bucket)0.9050.140−0.76516.36%improved
bastion.internal.ValidationCheckType$5.resetCacheClassNotFoundException0.3880.106−0.28212.35%improved
onlineid.internal.ApiRequest.sendResultNotSerializableException0.6490.073−0.5768.54%improved
com.c.b.b.bSS.loadClass (tampered APK)ClassNotFoundException1.7150.062−1.6537.25%improved
core.common.CoroutineTimer$startTimer$1OutOfMemoryError0.0930.048−0.0455.60%improved
core.app.ActivityCompat$Api28Impl.requireViewByIdIllegalArgumentException0.1530.028−0.1253.26%improved
identity.common…WebViewUtil.getCookieManagerMissingWebViewPackageException0.0880.025−0.0632.89%improved
core.app.JobIntentService…WrapperWorkItem.completeIllegalArgumentException0.1630.022−0.1412.55%improved
+
+
Reading this table: rows are the most-hit crash frames on the rolling-out build, ranked by 6.2606 per-1k rate. "Δ /1k" is the per-device rate change — green = fewer crashes per device. Share ≠ rate: ValidationCheckType takes a bigger share on 6.2606 (5.98% → 12.35%) yet its rate fell 73% — it is only a larger slice of a much smaller pie. The dominant legacy crash, an obfuscated ClassNotFoundException in a repackaged class (com.c.b.b.bSS.loadClass), is a tampered/sideloaded-APK signature, not a first-party bug — its near-disappearance (−1.65/1k) drives most of the overall improvement. Caveat: App Center crash uploads on an early-rollout build are still accumulating, so confirm the low rate holds as adoption grows; the 344 "new" long-tail clusters are each < 0.02/1k (one-window noise), not regressions.
+ + +
+
Crash attribution — new dagger.hilt EntryPoints.get on MfaAuthDialogActivity
+
+
Originator
+
Authenticator A first-party regression. The crash frame names MfaAuthDialogActivity, but the Activity is the victim — it was never edited. The culprit is a new caller in a different file (OathSecretEncryptionUseCase) that hands that Activity's context to Hilt's EntryPoints.get. Confirmed in-range release regression, not an OS/shrinker interaction.
+
+
+
Mechanism
+
dagger.hilt.EntryPoints.get:62 throws IllegalStateException — "component holder class MfaAuthDialogActivity does not implement interface dagger.hilt.internal.GeneratedComponent." The new OathSecretEncryptionUseCase resolves its ECS dependency via EntryPoints.get(applicationContext, SecureTotpEcsDependency::class.java). That applicationContext is bound from the legacy Dagger ContextModule.provideContext(), which returned the context the component was built with — and the MFA dialog fragments build it with requireContext() = the MfaAuthDialogActivity. So EntryPoints.get runs against an Activity, which is not a Hilt generated-component holder → crash. Genuinely-new this release (0 on the baseline), 66.7% Android 16 incidentally (the TOTP path is exercised more there), broad across device models.
+
+
+
Release range
+
authenticator 6.2606.3817..6.2606.4029 (34 commits), searched with find-suspect-prs.ps1 -Repos authenticator -Symbol 'EntryPoints.get' -DiffGrep 'EntryPoints|GeneratedComponent'. The exception-token pickaxe + -DiffGrep diff-text scan surface the culprit immediately — note a -Symbol MfaAuthDialogActivity search (the crashing class) finds nothing, because the Activity is only passed into the failing API by a caller in another file under an unrelated PR title.
+
+
+
Likely PRs
+
+
+
+ high +
+ authenticator PR 15896454 + [MSRC] [110950] - TOTP Secret Fix - Phase 1 +
7d3da30b13 · ADO · Cesar Acosta
+
Added OathSecretEncryptionUseCase, which calls EntryPoints.get(applicationContext, …) on a context that the legacy ContextModule resolves to the dialog Activity, not the application — the exact EntryPoints.get + GeneratedComponent frame in the stack. Its subject ("TOTP Secret Fix") never mentions Hilt/DI, which is why -DiffGrep (not --grep) was required to find it.
+
+
+
+ fix +
+ authenticator PR 16249408 + [Cherry-pick] Fix MFA auth dialog Hilt crash on saved-state restore (AB#3677526) +
023aec8abd · ADO · Melissa Ahn
+
The fix: normalizes ContextModule.provideContext() to = context.applicationContext, so EntryPoints.get always receives the process-wide application context even when callers (the MFA dialog fragments) build the component with an Activity context. Confirms the mechanism above.
+
+
+
+
+
+
+
Next step
+
Ship/verify the PR 16249408 fix (ContextModule.provideContext() = context.applicationContext, AB#3677526) in the next release and confirm the signature drops to 0. Add a regression test that builds the MFA-dialog component with an Activity context and asserts EntryPoints.get still resolves. Audit other legacy ContextModule consumers that pass requireContext() into Hilt. Low fleet rate (≈0.56/1k) and a fix in flight → watch, not a HOLD.
+
+
+ + + +

🛡️ Broker health · Authenticator-hosted — is the app release moving broker errors?

+

The Broker runs inside the Authenticator process. The Broker health section above pools every host (Link to Windows, Company Portal, Authenticator) and compares broker_version. This section isolates active_broker_package_name == "com.azure.authenticator" and compares by Authenticator app version (AppInfo_Version) — so a broker regression caused by this app rollout shows up here even when the fleet-wide broker numbers are dominated by other hosts. The new app 6.2606.3817 bundles broker 16.2.0; the previous 6.2605.3042 bundled 16.1.0.

+
+
+
Device error rate (Auth-hosted)
+
47.19%
+
−16.12 pp vs 63.31% on 6.2605
+
+
+
Silent reliability (req)
+
47.92%
+
+3.16 pp vs 44.76%
+
+
+
Silent reliability (dev)
+
87.20%
+
−6.81 pp vs 94.00%
+
+
+
Interactive reliability (dev)
+
65.90%
+
−3.57 pp vs 69.47%
+
+
+
Bundled broker
+
16.2.0
+
vs 16.1.0 on the 6.2605 app
+
+
+
Auth-hosted devices
+
12.40 M
+
18.4% of the 6.2605 cohort (early rollout)
+
+
+
Mixed, but not regressing: the broad device-error rate fell sharply and request-level silent reliability rose, while device-level reliability dipped a few points — the signature of an early-rollout cohort (upgrade + network churn). The device-share movers below look clean, but device-share dedups each device across all spans and hides per-span request spikes — the span drill-down further down catches two codes climbing on the silent path.
+
+ + + + + + + + + + + + + + + + + + +
error_code6.2605 dev-share6.2606 dev-shareΔ pp6.2606 devices
io_error82.48%80.05%−2.434.68 M
invalid_grant32.47%22.35%−10.121.31 M
no_account_found21.07%17.82%−3.251.04 M
timed_out_execution26.99%12.78%−14.210.75 M
no_tokens_found5.74%4.64%−1.110.27 M
device_network_not_available_doze_mode4.16%2.80%−1.360.16 M
interaction_required2.62%1.72%−0.900.10 M
multiple_apps_listening_url_scheme (only device-share riser)0.00%0.04%+0.042.2 K
+
+

Span drill-down — where the silent errors actually moved

+

The device-share table above dedups each device across all spans, so a code can climb on one hot span while its fleet-wide device-share falls. Slicing the same active_broker_package_name == "com.azure.authenticator" error stream by span_name at the request level (errored requests ÷ total requests in that span) tells a different story for two codes that looked flat-to-down above:

+
+ + + + + + + + + + + + + + + + + +
error_codespan6.2605 req-rate6.2606 req-rateΔ pp6.2606 err req
invalid_grantATIInteractively27.51%36.60%+9.0940 K
invalid_grantATISilently70.47%72.47%+2.01204 K
invalid_grantAcquireTokenSilent9.35%10.53%+1.1921.3 M
interaction_requiredMSAL_PerformIpcStrategy11.69%15.07%+3.382.6 K
interaction_requiredATISilently6.03%6.77%+0.7419 K
interaction_requiredAcquireTokenSilent0.39%0.48%+0.09969 K
+
+
Highest-volume mover: invalid_grant on the AcquireTokenSilent path rose 9.35% → 10.53% (+1.19 pp) across 21.3 M errored silent requests on the new app. The device-share table hid this because the dominant silent path is mostly io_error / no_account_found and the early cohort dedups "clean." ATIInteractively (+9.09 pp) and the IPC-strategy span (+3.38 pp) move hardest in rate but on small volumes (40 K / 2.6 K req).
+
Mostly early-rollout churn — but not entirely. Daily, invalid_grant in AcquireTokenSilent on 6.2606 started at 24.0% (Jun 09, first updaters) and decayed — 19.3% → 16.1% → 12.1% → … → 9.4% (Jun 21) — as the cohort re-bootstrapped credentials, while 6.2605 held steady at ~9.0%. So the 14-day +1.19 pp is inflated by the upgrade spike, yet 6.2606 still sits ~0.4 pp above baseline every day and has not fully converged. interaction_required tracks the same curve (0.98% → 0.45% vs ~0.39% baseline).
+
Verdict — device-share flat, but watch the silent path. Aggregate device-error rate fell (−16.12 pp) and no dominant code grew at the device level, so there is no broad broker regression from this app rollout. However, per-request invalid_grant / interaction_required on the silent path (AcquireTokenSilent / ATISilently) run consistently above baseline — mostly a decaying re-auth spike, with a ~0.4 pp residual that has not yet converged. Treat as watch, not clear: re-check after the cohort matures, and see the attribution below for the most likely trigger. Cross-check the fleet view: the all-hosts io_error +40 pp spike is driven by other hosts (Link to Windows ≈122 M devices), not Authenticator — Auth-hosted io_error actually fell (−2.43 pp).
+
+
Code attribution — silent invalid_grant / interaction_required
+
+
Originator
+
eSTS Both are server-returned OAuth errors from the token service — a broker/common change is the trigger (it makes the device present a credential eSTS then rejects), not the source of the error string.
+
+
+
Mechanism
+
The silent flow reads the device-bound RT / PRT from the in-memory token cache. The new filter-then-clone cache path (flight ENABLE_FILTER_THEN_CLONE_IN_MEMORY_CACHE) changes load() / getIdTokensForAccountRecord() to skip the clone-all preload and clone only the credentials matching a filter — a filtering miss returns a stale or empty credential set, so the silent refresh presents the wrong (or no) RT and eSTS rejects it with invalid_grant; the broker then surfaces interaction_required to force re-auth. Flight rollout + cache warm-up explains the observed partial, decaying signal, and the in-release #3110 fix (crash when the flight is on without an in-memory cache) confirms the path was actively buggy this release.
+
+
+
Release range
+
broker v16.1.0..v16.2.0 (15 commits) + the common range it pins via submodule pointer 57b9503..e8011745 (25 commits), read in full and pickaxed (find-suspect-prs.ps1 -Range) on the token-cache / PRT path. Note: common's own v16.x tags are an unrelated 2023 namespace — the range must be taken from the broker submodule pointer, not the common tag.
+
+
+
Likely PRs
+
+
+
+ med-high +
+ common#3100 + filter-then-clone cache series (with #3110, #3114) +
dabefab61 / 184019b87 / 3e6df21e5 · @siddhijain · AB#3588022, AB#3601828, AB#3590267
+
#3100 rewrites load() + getIdTokensForAccountRecord() in MsalOAuth2TokenCache.java (+284 lines) behind the ENABLE_FILTER_THEN_CLONE_IN_MEMORY_CACHE flight; #3110 fixes a ClassCastException when the flight is on without an in-memory cache; #3114 extends it to deleteAccessTokensWithIntersectingScopes. Directly rewrites the silent-path cache read/delete and is flight-gated — best causal fit for a partial, decaying invalid_grant rise on AcquireTokenSilent.
+
+
+
+ low-med +
+ broker#176 + supportsBoundService default → true +
68126018c · Pedro Romero Vargas · AB#3571860
+
Flips supportsBoundService from false to true in DeviceRegistrationClientApplication, changing IPC-strategy selection toward the bound-service path — a plausible driver of the interaction_required rise on the MSAL_PerformIpcStrategy span (+3.38 pp). Needs eSTS-side confirmation.
+
+
+
+ excluded +
+ broker#174 + null object get cache record + 2023 common#2222 false-positive +
9c9906686 · @siddhijain · AB#3607178
+
broker#174 touches the silent cache path but addresses a different code (null_object), reduces errors, and is gated to SdkType.MSAL_CPP (OneAuth) — not the Authenticator MSAL path. Note: the 2023 common#2222 (PoP re-key) is not in this release — it surfaced earlier only because the range tool resolved common's own v16.x tag namespace instead of the submodule pointer.
+
+
+
+
+
+
+
Next step
+
Pull eSTS correlation IDs for 6.2606 devices hitting invalid_grant in AcquireTokenSilent and split by the ENABLE_FILTER_THEN_CLONE_IN_MEMORY_CACHE flight cohort (flighted vs not). If the rise concentrates in the flighted cohort, #3100/#3110 is confirmed — hold or roll back the flight and fast-track the #3110 fix fleet-wide. Separately bisect broker#176 for the MSAL_PerformIpcStrategy interaction_required rise.
+
+
+ + + +

Appendix

+
+ Data sources +
+

Broker — cluster https://idsharedeus2.kusto.windows.net, db ad-accounts-android-otel. Version dimension broker_version. MVs: SilentAuthStats*Metrics, InteractiveAuthStats*Metrics, ErrorStatsMetrics, BrokerAdoptionStatsUpdated, PerfStatsUpdated.

+

Authenticator — cluster https://idsharedeus2.eastus2.kusto.windows.net, db d496be22d62a46b0a3cf67ea2e736fd8. Version dimension AppVersion. Scenario MVs per assets/queries/README.md.

+

Broker hosted by Authenticator — same Broker cluster/db as above, but the Broker MVs (ErrorStatsMetrics, *AuthStats*Metrics, BrokerAdoptionStatsUpdated) also carry active_broker_package_name (the host app — com.azure.authenticator) and AppInfo_Version (which, for that package, is the Authenticator app version). Filtering on both isolates the broker as it runs inside a specific Authenticator release. Queries: broker-by-host-app.kql, broker-top-errors-by-host-app.kql, broker-errors-by-host-app-span.kql (span drill-down). PR correlation across the broker version range via assets/scripts/find-suspect-prs.ps1.

+

Authenticator crashes — App Center Diagnostics (errors/errorGroups), app authapp-t7qc / Microsoft-Authenticator-Android-Prod-App-Center, pulled by assets/scripts/fetch-appcenter-crashes.js. Rate denominator (active devices/version) from Kusto via authenticator-crash-denominator.kql. App Center Analytics (native crash-free %) is retired, so the rate is computed App-Center-numerator ÷ Kusto-denominator. Setup & caveats: assets/docs/crash-sources.md. Play Console: not yet wired (Phase 2 — needs a gated GCP service account).

+
+
+
+ Queries used +
+

All under assets/queries/: broker-error-rate-by-version.kql, broker-reliability-by-version.kql, broker-top-errors-by-version.kql, broker-latency-by-version.kql, broker-adoption.kql, broker-by-host-app.kql, broker-top-errors-by-host-app.kql, broker-errors-by-host-app-span.kql, auth-scenario-success-rate.kql, auth-pn-checkforauth-completion.kql, auth-reacted-notification-split.kql, auth-scenario-initiates.kql, auth-version-resolve.kql, auth-stats.kql, authenticator-crash-denominator.kql. Crash clusters via assets/scripts/fetch-appcenter-crashes.js (App Center).

+
+
+
+ Methodology & caveats +
+
    +
  • Distinct devices use dcount_hll(hll_merge(countDevicesHll)) — sketches are merged, never summed.
  • +
  • Device error rate uses a broad "any non-success error_code" denominator; the error-movers table attributes it to specific codes.
  • +
  • Host-app attribution: to tell whether the Authenticator rollout itself moved broker errors, scope Broker MVs to active_broker_package_name == "com.azure.authenticator" and compare by AppInfo_Version (the app version for that host). Fleet-wide broker_version deltas can be dominated by other hosts (Link to Windows ≈122 M devices) and must not be attributed to Authenticator.
  • +
  • Device-share masks per-span spikes: the host-app error table is a device-share (devices hitting code X anywhere ÷ devices on that version), which dedups a device across all spans. A code can look flat-to-down there while its per-request rate climbs inside one span. When a code is suspected of a span-local rise, re-slice by span_name at the request level with broker-errors-by-host-app-span.kql.
  • +
  • Release PR correlation: for server-returned codes (invalid_grant, interaction_required — Originator = eSTS) the trigger is a broker/common change in the bundled broker version range. Read git log v<PREV>..v<NEW> in broker/ + common/ in full (ranges are small), then pickaxe with find-suspect-prs.ps1; weight device-PoP / PRT / cache-path changes for silent-auth credential rejections.
  • +
  • Early-rollout bias: a version with far fewer devices is skewed toward upgrade/network churn — treat single-window regressions as provisional until adoption grows.
  • +
  • Volume guard: percentage moves on < ~1K initiates are noise (flagged "low-volume").
  • +
  • Latency percentiles merged via tdigest_merge; never summed.
  • +
  • Crash rate = App Center crash count ÷ Kusto active devices, per version (per 1k). Lead with this rate, not crash-share: a signature can grow its share of a shrinking crash pool while its per-device rate falls. App Center's own crash-free metric is retired; numerator coverage on early-rollout builds can lag — see assets/docs/crash-sources.md.
  • +
+
+
+ +
+ + + + +