Recover stuck SCEP managed-cert state via matcher extension#44691
Recover stuck SCEP managed-cert state via matcher extension#44691mostlikelee wants to merge 13 commits intomainfrom
Conversation
This reverts commit 96569a9.
Reapplies the three independent improvements from #44250 (reverted via #44535) and adds an ingest-side backfill that catches the actual silent-fail mechanism (missed toInsert matcher) without breaking the natural in-flight synchronization between reconcile and the renewal cron. - Bump OneTimeChallengeTTL 1h → 7d so renewals don't fail with "challenge not found" for offline devices that pick up the InstallProfile push days later. - Restrict the renewal cron to settled delivery states ('verified', 'failed') to avoid re-firing renewal while a previous delivery is still in flight. - Gate the new 'failed' branch on a 24h backoff so permanent render-time failures (CA deleted, missing IDP variables) don't loop hourly. - Add backfillHostMDMManagedCertsFromHostCertsDB: when the toInsert matcher in UpdateHostCertificates misses a renewed cert (replica lag, transaction race, verified-without-actual-renewal), look up a matching cert in host_certificates by the 'fleet-<profile_uuid>' substring and populate hmmc. Gated by a 4h grace on hmmc.updated_at so it doesn't clobber the in-flight blank-out, and a monotonic-forward predicate so it's idempotent. Does NOT reintroduce the COALESCE-preserve in BulkUpsertMDMManagedCertificates or the iOS-only park-at-'verifying' carve-out from #44250 — those broke the natural cron synchronization gate (reconcile NULLs hmmc → cron's HAVING IS NOT NULL excludes the row until ingest repopulates). Resolves #44111
…orenew # Conflicts: # server/datastore/mysql/mdm.go
…/fleet into 44111-scep-autorenew-fix
Implements OpenSpec change extend-scep-cert-matcher. Drops the standalone backfillHostMDMManagedCertsFromHostCertsDB function and the call site after withRetryTxx in UpdateHostCertificates. The same recovery work now happens inside the existing toInsert matcher with no new database queries: ListHostMDMManagedCertificates is replaced with a single SELECT that joins to the per-platform profile tables to also return the delivery status, and the matcher uses two cert pools selected per hmmc row. When an hmmc row is stuck (not_valid_after IS NULL, updated_at older than hmmcBackfillGrace, AND the related profile is in a settled 'verified'/'failed' state), the matcher widens its search from toInsertBySHA1 to incomingBySHA1 — giving certs that landed in existingBySHA1 from a prior call (replica lag, race, missed earlier match) a second chance to update hmmc. Steady-state and in-flight rows still see only toInsertBySHA1, preserving the original "react to NEW certs" semantics so a pre-renewal cert still in host_certificates can't clobber the in-flight blank-out. Also switches "first match wins" to "best match wins" (latest not_valid_before among currently-valid candidates) and adds a monotonic-forward predicate so a stale cert can't regress hmmc. MDMManagedCertificate gains an UpdatedAt field and ListHostMDMManagedCertificates loads it; no callers depended on the prior column set. Resolves #44111
# Conflicts: # changes/44111-scep-autorenew-fail # server/datastore/mysql/mdm.go # server/fleet/mdm.go
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #44691 +/- ##
==========================================
+ Coverage 66.68% 66.79% +0.11%
==========================================
Files 2651 2641 -10
Lines 213567 212959 -608
Branches 9767 9401 -366
==========================================
- Hits 142411 142242 -169
+ Misses 58186 57756 -430
+ Partials 12970 12961 -9
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
WalkthroughRecover stuck host_mdm_managed_certificates rows after missed SCEP renewals by adding an isSettledStatus helper and hmmcBackfillGrace constant, changing UpdateHostCertificates to build incoming and toInsert maps keyed by normalized SHA1, join managed-certificate rows with profile delivery statuses, compute per-row settled/stuck state, choose candidate cert pools accordingly, filter candidates by renewal identifier and validity window, select the best candidate by latest NotValidBefore, enforce monotonic-forward updates, and only persist changes when fields differ. Added UpdatedAt to MDMManagedCertificate and included updated_at in the ListHostMDMManagedCertificates projection. Possibly related PRs
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@server/datastore/mysql/host_certificates.go`:
- Around line 138-139: The stuck-row recovery logic must not be gated on
toInsert; move the widened-matcher/recovery code out of the if len(toInsert) > 0
branch so it always runs, and make it operate solely using existingBySHA1 (and
the same widened matching logic) to detect renewed certs already present and
append them to hostMDMManagedCertsToUpdate as before; ensure the subsequent
update/DB calls that consume hostMDMManagedCertsToUpdate still run when that
slice is non-empty, and remove any dependence on toInsert length when deciding
to execute the widened matcher.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 8a3db9b0-059b-4d4a-9a56-f9d23218ef93
⛔ Files ignored due to path filters (4)
openspec/changes/extend-scep-cert-matcher/design.mdis excluded by!**/*.mdopenspec/changes/extend-scep-cert-matcher/proposal.mdis excluded by!**/*.mdopenspec/changes/extend-scep-cert-matcher/specs/mdm-cert-state-sync/spec.mdis excluded by!**/*.mdopenspec/changes/extend-scep-cert-matcher/tasks.mdis excluded by!**/*.md
📒 Files selected for processing (6)
changes/44111-scep-autorenew-failopenspec/changes/extend-scep-cert-matcher/.openspec.yamlserver/datastore/mysql/host_certificates.goserver/datastore/mysql/host_certificates_test.goserver/datastore/mysql/mdm.goserver/fleet/apple_mdm.go
There was a problem hiding this comment.
Claude Code Review
This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.
Tip: disable this comment in your organization's Code Review settings.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
server/datastore/mysql/host_certificates_test.go (1)
500-508: 💤 Low valueMinor: the "matcher's gate" comment is now stale.
This PR removes the
len(toInsert) > 0gate, so the unrelated cert is no longer needed to make the matcher fire —StableCertListRecoversproves recovery runs with an emptytoInsertBySHA1. The helper is still useful for keeping the non-stuck subtests on the narrow-pool branch (pool = toInsertBySHA1), but the parenthetical "(the matcher's gate)" describes pre-PR behavior.📝 Suggested wording tweak
- // Adds an unrelated cert to populate toInsert (the matcher's gate). + // Adds an unrelated cert so toInsertBySHA1 is non-empty, exercising the + // narrow-pool branch for non-stuck rows (distinct from StableCertListRecovers, + // which exercises the empty-toInsert recovery path).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@server/datastore/mysql/host_certificates_test.go` around lines 500 - 508, Update the inline comment in triggerMatcher to remove the now-stale parenthetical "(the matcher's gate)" and reword it to reflect current behavior: note that the unrelated cert is no longer required to make the matcher run (the len(toInsert) > 0 gate was removed) but the helper still serves to keep non-stuck subtests on the narrow-pool branch (pool = toInsertBySHA1); reference triggerMatcher, StableCertListRecovers, toInsertBySHA1 and toInsert in the comment so future readers understand why the unrelated cert is still being added.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@server/datastore/mysql/host_certificates_test.go`:
- Around line 500-508: Update the inline comment in triggerMatcher to remove the
now-stale parenthetical "(the matcher's gate)" and reword it to reflect current
behavior: note that the unrelated cert is no longer required to make the matcher
run (the len(toInsert) > 0 gate was removed) but the helper still serves to keep
non-stuck subtests on the narrow-pool branch (pool = toInsertBySHA1); reference
triggerMatcher, StableCertListRecovers, toInsertBySHA1 and toInsert in the
comment so future readers understand why the unrelated cert is still being
added.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 5298cd37-83ff-4664-aeb1-5d3fac126629
📒 Files selected for processing (2)
server/datastore/mysql/host_certificates.goserver/datastore/mysql/host_certificates_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- server/datastore/mysql/host_certificates.go
There was a problem hiding this comment.
Pull request overview
Extends the SCEP managed-certificate matcher in UpdateHostCertificates to recover host_mdm_managed_certificates rows that can get stuck with NULL validity/serial after a renewal cert was ingested but never linked, preventing the renewal cron from ever retrying.
Changes:
- Always run the managed-cert matcher on
UpdateHostCertificates, using a “stuck vs in-flight” heuristic to decide whether to match against the full incoming inventory vs only newly-inserted certs. - Add
updated_attofleet.MDMManagedCertificateand to the managed-certs listing query to support the grace-window check. - Add datastore tests covering recovery, grace-window protection, monotonic-forward behavior, and DigiCert/pending-profile exclusions.
Reviewed changes
Copilot reviewed 9 out of 10 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| server/fleet/apple_mdm.go | Adds UpdatedAt to MDMManagedCertificate to support stuck-row detection. |
| server/datastore/mysql/mdm.go | Includes updated_at in ListHostMDMManagedCertificates SELECT. |
| server/datastore/mysql/host_certificates.go | Implements the widened matcher logic (stuck vs in-flight pools) and monotonic-forward update rule. |
| server/datastore/mysql/host_certificates_test.go | Adds focused coverage for stuck-row recovery and safety gates. |
| openspec/changes/extend-scep-cert-matcher/tasks.md | Adds task checklist (currently diverges from implemented behavior). |
| openspec/changes/extend-scep-cert-matcher/specs/mdm-cert-state-sync/spec.md | Adds requirements spec (currently diverges from implemented behavior). |
| openspec/changes/extend-scep-cert-matcher/proposal.md | Adds proposal rationale (currently diverges from implemented behavior). |
| openspec/changes/extend-scep-cert-matcher/design.md | Adds design notes (currently diverges from implemented behavior). |
| openspec/changes/extend-scep-cert-matcher/.openspec.yaml | OpenSpec metadata file. |
| changes/44111-scep-autorenew-fail | Release note entry for the renewal recovery fix. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Summary
Resolves #44111.
The matcher inside
UpdateHostCertificates(server/datastore/mysql/host_certificates.go) only fires for newly-inserted certs. When a renewed SCEP cert is reported but the matcher misses linking it tohost_mdm_managed_certificates(replica lag, race, or the cert landing inexistingBySHA1instead oftoInsert), the row stays NULL. The renewal cron'sHAVING validity_period IS NOT NULLlock then permanently excludes it; only an admin re-push recovers it.Change
Run the matcher on every
UpdateHostCertificatescall (no longer gated onlen(toInsert) > 0). Perhmmcrow, pick the cert pool:not_valid_*NULL ANDupdated_atolder thanhmmcBackfillGrace = 4hAND profile in a settled'verified'/'failed'state) → search the full incoming inventory; recovers rows that earlier calls missed.toInsert; matches today's "react to new certs" semantics so an in-flight renewal can't be clobbered by the pre-renewal cert still present inhost_certificates.Per pool, picks the freshest currently-valid match and applies a monotonic-forward predicate so a stale cert can't regress fresh
hmmcdata.Cost
One additional SELECT per
UpdateHostCertificatescall:ListHostMDMManagedCertificatesis replaced by a query that alsoLEFT JOINs tohost_mdm_apple_profilesandhost_mdm_windows_profilesfor delivery status. The query is host-uuid-keyed against indexed PKs.I'll load test this before merging.
OpenSpec
OpenSpec files are here purely for PR reference. I plan on removing them after review.
Summary by CodeRabbit