Skip to content

Recover stuck SCEP managed-cert state via matcher extension#44691

Open
mostlikelee wants to merge 13 commits intomainfrom
44111-scep-autorenew-fix
Open

Recover stuck SCEP managed-cert state via matcher extension#44691
mostlikelee wants to merge 13 commits intomainfrom
44111-scep-autorenew-fix

Conversation

@mostlikelee
Copy link
Copy Markdown
Contributor

@mostlikelee mostlikelee commented May 4, 2026

Summary

Resolves #44111.

The matcher inside UpdateHostCertificates (server/datastore/mysql/host_certificates.go) only fires for newly-inserted certs. When a renewed SCEP cert is reported but the matcher misses linking it to host_mdm_managed_certificates (replica lag, race, or the cert landing in existingBySHA1 instead of toInsert), the row stays NULL. The renewal cron's HAVING validity_period IS NOT NULL lock then permanently excludes it; only an admin re-push recovers it.

Change

Run the matcher on every UpdateHostCertificates call (no longer gated on len(toInsert) > 0). Per hmmc row, pick the cert pool:

  • Stuck (not_valid_* NULL AND updated_at older than hmmcBackfillGrace = 4h AND profile in a settled 'verified'/'failed' state) → search the full incoming inventory; recovers rows that earlier calls missed.
  • Otherwise → search only toInsert; matches today's "react to new certs" semantics so an in-flight renewal can't be clobbered by the pre-renewal cert still present in host_certificates.

Per pool, picks the freshest currently-valid match and applies a monotonic-forward predicate so a stale cert can't regress fresh hmmc data.

Cost

One additional SELECT per UpdateHostCertificates call: ListHostMDMManagedCertificates is replaced by a query that also LEFT JOINs to host_mdm_apple_profiles and host_mdm_windows_profiles for delivery status. The query is host-uuid-keyed against indexed PKs.

I'll load test this before merging.

OpenSpec

OpenSpec files are here purely for PR reference. I plan on removing them after review.

Summary by CodeRabbit

  • Bug Fixes
    • Recovered SCEP/MDM managed-certificate rows that became stuck when renewals were not linked, restoring correct cert validity and serial data.
    • Improved status tracking to distinguish active vs stuck renewals and respect grace-period boundaries.
    • Prevented regressions by ensuring certificate updates only move validity forward and skipped backfill for in-flight/pending profiles.

Reapplies the three independent improvements from #44250 (reverted via #44535)
and adds an ingest-side backfill that catches the actual silent-fail mechanism
(missed toInsert matcher) without breaking the natural in-flight
synchronization between reconcile and the renewal cron.

- Bump OneTimeChallengeTTL 1h → 7d so renewals don't fail with "challenge not
  found" for offline devices that pick up the InstallProfile push days later.
- Restrict the renewal cron to settled delivery states ('verified', 'failed')
  to avoid re-firing renewal while a previous delivery is still in flight.
- Gate the new 'failed' branch on a 24h backoff so permanent render-time
  failures (CA deleted, missing IDP variables) don't loop hourly.
- Add backfillHostMDMManagedCertsFromHostCertsDB: when the toInsert matcher
  in UpdateHostCertificates misses a renewed cert (replica lag, transaction
  race, verified-without-actual-renewal), look up a matching cert in
  host_certificates by the 'fleet-<profile_uuid>' substring and populate
  hmmc. Gated by a 4h grace on hmmc.updated_at so it doesn't clobber the
  in-flight blank-out, and a monotonic-forward predicate so it's idempotent.

Does NOT reintroduce the COALESCE-preserve in BulkUpsertMDMManagedCertificates
or the iOS-only park-at-'verifying' carve-out from #44250 — those broke the
natural cron synchronization gate (reconcile NULLs hmmc → cron's HAVING IS
NOT NULL excludes the row until ingest repopulates).

Resolves #44111
…orenew

# Conflicts:
#	server/datastore/mysql/mdm.go
Implements OpenSpec change extend-scep-cert-matcher.

Drops the standalone backfillHostMDMManagedCertsFromHostCertsDB function
and the call site after withRetryTxx in UpdateHostCertificates. The same
recovery work now happens inside the existing toInsert matcher with no
new database queries: ListHostMDMManagedCertificates is replaced with a
single SELECT that joins to the per-platform profile tables to also
return the delivery status, and the matcher uses two cert pools
selected per hmmc row.

When an hmmc row is stuck (not_valid_after IS NULL, updated_at older
than hmmcBackfillGrace, AND the related profile is in a settled
'verified'/'failed' state), the matcher widens its search from
toInsertBySHA1 to incomingBySHA1 — giving certs that landed in
existingBySHA1 from a prior call (replica lag, race, missed earlier
match) a second chance to update hmmc. Steady-state and in-flight rows
still see only toInsertBySHA1, preserving the original "react to NEW
certs" semantics so a pre-renewal cert still in host_certificates
can't clobber the in-flight blank-out.

Also switches "first match wins" to "best match wins" (latest
not_valid_before among currently-valid candidates) and adds a
monotonic-forward predicate so a stale cert can't regress hmmc.

MDMManagedCertificate gains an UpdatedAt field and ListHostMDMManagedCertificates
loads it; no callers depended on the prior column set.

Resolves #44111
Base automatically changed from revert-44250-scep-autorenew to main May 4, 2026 19:33
# Conflicts:
#	changes/44111-scep-autorenew-fail
#	server/datastore/mysql/mdm.go
#	server/fleet/mdm.go
@codecov
Copy link
Copy Markdown

codecov Bot commented May 4, 2026

Codecov Report

❌ Patch coverage is 92.53731% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 66.79%. Comparing base (8d37ec6) to head (e7db46f).
⚠️ Report is 23 commits behind head on main.

Files with missing lines Patch % Lines
server/datastore/mysql/host_certificates.go 93.93% 2 Missing and 2 partials ⚠️
server/datastore/mysql/mdm.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #44691      +/-   ##
==========================================
+ Coverage   66.68%   66.79%   +0.11%     
==========================================
  Files        2651     2641      -10     
  Lines      213567   212959     -608     
  Branches     9767     9401     -366     
==========================================
- Hits       142411   142242     -169     
+ Misses      58186    57756     -430     
+ Partials    12970    12961       -9     
Flag Coverage Δ
backend 68.56% <92.53%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@mostlikelee
Copy link
Copy Markdown
Contributor Author

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 4, 2026

✅ Actions performed

Full review triggered.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 4, 2026

Walkthrough

Recover stuck host_mdm_managed_certificates rows after missed SCEP renewals by adding an isSettledStatus helper and hmmcBackfillGrace constant, changing UpdateHostCertificates to build incoming and toInsert maps keyed by normalized SHA1, join managed-certificate rows with profile delivery statuses, compute per-row settled/stuck state, choose candidate cert pools accordingly, filter candidates by renewal identifier and validity window, select the best candidate by latest NotValidBefore, enforce monotonic-forward updates, and only persist changes when fields differ. Added UpdatedAt to MDMManagedCertificate and included updated_at in the ListHostMDMManagedCertificates projection.

Possibly related PRs

  • fleetdm/fleet#30578: Modifies UpdateHostCertificates with SHA1-based lookup and filtering logic similar to the toInsertBySHA1 map approach.
  • fleetdm/fleet#44339: Touches UpdateHostCertificates and host_certificates flows (ingestion-origin/ deletion changes) and thus overlaps the same function and data paths.
  • fleetdm/fleet#44535: Modifies SCEP managed-certificate renewal/matching and host_mdm_managed_certificates update logic in host_certificates.go.
🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The PR description includes related issue (#44111), summarizes the problem and proposed change, documents the cost/performance impact, and notes OpenSpec files for review. However, the description does not check any of the template boxes for testing, database migrations, or other compliance items. Complete the PR template checklist by checking relevant boxes (automated tests, manual testing, database migration checks) to verify all compliance requirements have been addressed.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title 'Recover stuck SCEP managed-cert state via matcher extension' is clear, concise, and directly summarizes the main change: recovering stuck managed certificates through matcher logic.
Linked Issues check ✅ Passed The PR addresses issue #44111 by implementing the matcher extension to recover stuck host_mdm_managed_certificates rows with NULL validity fields, matching the stated problem of certificates failing to auto-renew due to missed matching.
Out of Scope Changes check ✅ Passed All code changes focus on recovering stuck managed certificates: matcher logic updates, test coverage, and schema additions for UpdatedAt timestamp. OpenSpec files are noted as temporary for review reference only.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch 44111-scep-autorenew-fix

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@server/datastore/mysql/host_certificates.go`:
- Around line 138-139: The stuck-row recovery logic must not be gated on
toInsert; move the widened-matcher/recovery code out of the if len(toInsert) > 0
branch so it always runs, and make it operate solely using existingBySHA1 (and
the same widened matching logic) to detect renewed certs already present and
append them to hostMDMManagedCertsToUpdate as before; ensure the subsequent
update/DB calls that consume hostMDMManagedCertsToUpdate still run when that
slice is non-empty, and remove any dependence on toInsert length when deciding
to execute the widened matcher.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8a3db9b0-059b-4d4a-9a56-f9d23218ef93

📥 Commits

Reviewing files that changed from the base of the PR and between 8d37ec6 and 366e29b.

⛔ Files ignored due to path filters (4)
  • openspec/changes/extend-scep-cert-matcher/design.md is excluded by !**/*.md
  • openspec/changes/extend-scep-cert-matcher/proposal.md is excluded by !**/*.md
  • openspec/changes/extend-scep-cert-matcher/specs/mdm-cert-state-sync/spec.md is excluded by !**/*.md
  • openspec/changes/extend-scep-cert-matcher/tasks.md is excluded by !**/*.md
📒 Files selected for processing (6)
  • changes/44111-scep-autorenew-fail
  • openspec/changes/extend-scep-cert-matcher/.openspec.yaml
  • server/datastore/mysql/host_certificates.go
  • server/datastore/mysql/host_certificates_test.go
  • server/datastore/mysql/mdm.go
  • server/fleet/apple_mdm.go

Comment thread server/datastore/mysql/host_certificates.go Outdated
@mostlikelee mostlikelee marked this pull request as ready for review May 5, 2026 12:43
@mostlikelee mostlikelee requested a review from a team as a code owner May 5, 2026 12:43
Copilot AI review requested due to automatic review settings May 5, 2026 12:43
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

Tip: disable this comment in your organization's Code Review settings.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
server/datastore/mysql/host_certificates_test.go (1)

500-508: 💤 Low value

Minor: the "matcher's gate" comment is now stale.

This PR removes the len(toInsert) > 0 gate, so the unrelated cert is no longer needed to make the matcher fire — StableCertListRecovers proves recovery runs with an empty toInsertBySHA1. The helper is still useful for keeping the non-stuck subtests on the narrow-pool branch (pool = toInsertBySHA1), but the parenthetical "(the matcher's gate)" describes pre-PR behavior.

📝 Suggested wording tweak
-	// Adds an unrelated cert to populate toInsert (the matcher's gate).
+	// Adds an unrelated cert so toInsertBySHA1 is non-empty, exercising the
+	// narrow-pool branch for non-stuck rows (distinct from StableCertListRecovers,
+	// which exercises the empty-toInsert recovery path).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/datastore/mysql/host_certificates_test.go` around lines 500 - 508,
Update the inline comment in triggerMatcher to remove the now-stale
parenthetical "(the matcher's gate)" and reword it to reflect current behavior:
note that the unrelated cert is no longer required to make the matcher run (the
len(toInsert) > 0 gate was removed) but the helper still serves to keep
non-stuck subtests on the narrow-pool branch (pool = toInsertBySHA1); reference
triggerMatcher, StableCertListRecovers, toInsertBySHA1 and toInsert in the
comment so future readers understand why the unrelated cert is still being
added.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@server/datastore/mysql/host_certificates_test.go`:
- Around line 500-508: Update the inline comment in triggerMatcher to remove the
now-stale parenthetical "(the matcher's gate)" and reword it to reflect current
behavior: note that the unrelated cert is no longer required to make the matcher
run (the len(toInsert) > 0 gate was removed) but the helper still serves to keep
non-stuck subtests on the narrow-pool branch (pool = toInsertBySHA1); reference
triggerMatcher, StableCertListRecovers, toInsertBySHA1 and toInsert in the
comment so future readers understand why the unrelated cert is still being
added.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5298cd37-83ff-4664-aeb1-5d3fac126629

📥 Commits

Reviewing files that changed from the base of the PR and between 366e29b and e7db46f.

📒 Files selected for processing (2)
  • server/datastore/mysql/host_certificates.go
  • server/datastore/mysql/host_certificates_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • server/datastore/mysql/host_certificates.go

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Extends the SCEP managed-certificate matcher in UpdateHostCertificates to recover host_mdm_managed_certificates rows that can get stuck with NULL validity/serial after a renewal cert was ingested but never linked, preventing the renewal cron from ever retrying.

Changes:

  • Always run the managed-cert matcher on UpdateHostCertificates, using a “stuck vs in-flight” heuristic to decide whether to match against the full incoming inventory vs only newly-inserted certs.
  • Add updated_at to fleet.MDMManagedCertificate and to the managed-certs listing query to support the grace-window check.
  • Add datastore tests covering recovery, grace-window protection, monotonic-forward behavior, and DigiCert/pending-profile exclusions.

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
server/fleet/apple_mdm.go Adds UpdatedAt to MDMManagedCertificate to support stuck-row detection.
server/datastore/mysql/mdm.go Includes updated_at in ListHostMDMManagedCertificates SELECT.
server/datastore/mysql/host_certificates.go Implements the widened matcher logic (stuck vs in-flight pools) and monotonic-forward update rule.
server/datastore/mysql/host_certificates_test.go Adds focused coverage for stuck-row recovery and safety gates.
openspec/changes/extend-scep-cert-matcher/tasks.md Adds task checklist (currently diverges from implemented behavior).
openspec/changes/extend-scep-cert-matcher/specs/mdm-cert-state-sync/spec.md Adds requirements spec (currently diverges from implemented behavior).
openspec/changes/extend-scep-cert-matcher/proposal.md Adds proposal rationale (currently diverges from implemented behavior).
openspec/changes/extend-scep-cert-matcher/design.md Adds design notes (currently diverges from implemented behavior).
openspec/changes/extend-scep-cert-matcher/.openspec.yaml OpenSpec metadata file.
changes/44111-scep-autorenew-fail Release note entry for the renewal recovery fix.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread openspec/changes/extend-scep-cert-matcher/proposal.md
Comment thread openspec/changes/extend-scep-cert-matcher/design.md
Comment thread openspec/changes/extend-scep-cert-matcher/tasks.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Custom SCEP proxy certificates intermittently failing to auto-renew

3 participants