Skip to content

Add drift detection and periodic resync for OpenStack resources#759

Open
eshulman2 wants to merge 10 commits into
k-orc:mainfrom
eshulman2:forge/aisos-376
Open

Add drift detection and periodic resync for OpenStack resources#759
eshulman2 wants to merge 10 commits into
k-orc:mainfrom
eshulman2:forge/aisos-376

Conversation

@eshulman2

@eshulman2 eshulman2 commented Apr 19, 2026

Copy link
Copy Markdown
Contributor

Summary

Implements periodic resync and external deletion handling for the OpenStack Resource Controller. When enabled, ORC periodically reconciles resources with OpenStack to detect configuration drift and external deletions, automatically recreating managed resources that were deleted outside of ORC while surfacing terminal errors for imported resources.

Changes

API Changes

  • Added resyncPeriod field to the code generator spec template for per-resource resync configuration
  • Added lastSyncTime field to the code generator status template to track successful reconciliation timestamps
  • Added GetResyncPeriod(), GetLastSyncTime(), and IsImported() methods to APIObjectAdapter interface

CLI & Manager

  • Added --default-resync-period CLI flag with default value of 0 (disabled)
  • Added ResyncConfigurable interface to propagate global default to all controllers

Core Resync Logic

  • Created internal/controllers/generic/resync/ package with:
    • DetermineResyncPeriod(): resolves effective period (per-resource → global → disabled)
    • CalculateJitteredDuration(): wraps wait.Jitter with 20% positive-only jitter
    • ShouldScheduleResync(): guards against scheduling for disabled/terminal/pending-requeue states
  • Modified shouldReconcile() to trigger reconciliation when resync period has elapsed
  • Modified reconcileNormal() to schedule jittered requeue after successful reconciliation

External Deletion Handling

  • Modified GetOrCreateOSResource() to detect 404s when fetching by status.id
  • Returns typed ExternallyDeleted signal via ReconcileStatus for managed non-imported resources
  • Restored safety guard for unexpected nil returns from actuators
  • Imported/unmanaged resources return terminal error
  • Added ClearStatusID() to status writer for clearing ID before recreation

Status Updates

  • Modified UpdateStatus() to set lastSyncTime only on successful reconciliation
  • Fixed Available condition: terminal errors now set False instead of Unknown

Code Generation

  • Updated api.template and adapter.template with new fields and methods
  • Regenerated all resources, CRDs, deep copy, and client code

Documentation

  • Created website/docs/user-guide/drift-detection.md
  • Updated enhancements/drift-detection.md with implementation status

Commit structure

  1. API fields + templates — hand-written type and template changes
  2. Regenerate — all generated code (CRDs, adapters, apply configs, OpenAPI, docs)
  3. Resync scheduling — scheduler, controller integration, CLI flag, all controller wiring
  4. E2E tests for resync — kuttl test suites + integration tests
  5. External deletion — detection, recreation, condition fix, unit + E2E tests
  6. Documentation — user guide and enhancement doc
  7. Design update — update enhancement proposal with typed ExternallyDeleted signal

Design decisions

  • Typed ExternallyDeleted signal: The original proposal used a (nil, nil) return from GetOrCreateOSResource to signal external deletion. This was replaced with a typed ExternallyDeleted flag on ReconcileStatus, which avoids ambiguity with actuator bugs that might also return nil and restores a safety guard for unexpected nil returns from GetOSResourceByID.
  • wait.Jitter reuse: CalculateJitteredDuration wraps k8s.io/apimachinery/pkg/util/wait.Jitter rather than reimplementing the same [base, base*1.2) jitter range.
  • Removed dead CommonStatus/CommonOptions types: These types were declared but never embedded — the code generator inlines the fields directly. They were removed to avoid misleading readers.

Testing

  • Unit tests for resync package, shouldReconcile, status updates, external deletion (all policy/import combinations)
  • 6 kuttl E2E test suites: resync-period, resync-disabled, resync-terminal-error, resync-jitter, external-deletion, external-deletion-import
  • go build, go vet, make lint all pass

@github-actions github-actions Bot added the semver:minor Backwards-compatible change label Apr 19, 2026
@eshulman2 eshulman2 changed the title [AISOS-376] Implementation for AISOS-376 [AISOS-376] Add drift detection and periodic resync for OpenStack resources Apr 20, 2026

@eshulman2 eshulman2 left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are some issues especially with docs

Comment thread internal/controllers/generic/reconciler/controller.go Outdated
Comment thread internal/controllers/generic/reconciler/resource_actions.go Outdated
Comment thread internal/controllers/generic/reconciler/controller.go Outdated
Comment thread api/v1alpha1/controller_options.go Outdated
Comment thread internal/controllers/generic/status/status_test.go Outdated
Comment thread internal/controllers/generic/resync/scheduler_test.go Outdated
Comment thread enhancements/drift-detection.md Outdated
@mandre mandre added the hold Do not merge label Apr 23, 2026
@mandre

mandre commented Apr 23, 2026

Copy link
Copy Markdown
Collaborator

Adding the hold label because this is an experiment.

@eshulman2 eshulman2 changed the title [AISOS-376] Add drift detection and periodic resync for OpenStack resources Add drift detection and periodic resync for OpenStack resources May 25, 2026
@eshulman2 eshulman2 force-pushed the forge/aisos-376 branch 4 times, most recently from 9138a1d to da9fb9b Compare May 31, 2026 08:28
@eshulman2

Copy link
Copy Markdown
Contributor Author

@mandre I revised the PR myself removing any indication of internal references I think it is ready for review

Comment thread .gitignore Outdated
@eshulman2 eshulman2 requested a review from dlaw4608 June 9, 2026 06:42
eshulman2 added 8 commits June 9, 2026 09:53
Add resyncPeriod to the code generator spec template allowing
per-resource configuration of how frequently the controller
re-reconciles even when no changes are detected. Add lastSyncTime to the
status template to track when the last successful resync occurred.

Update the adapter template and interface to expose GetResyncPeriod(),
GetLastSyncTime(), and IsImported() methods.
Run make generate to update all generated code after API field and
template changes.
Add --default-resync-period CLI flag to the manager for setting a global
default resync period. Implement DetermineResyncPeriod to resolve the
effective period using per-resource then global default precedence.

Create resync scheduler package using wait.Jitter for jittered duration
calculation. Modify shouldReconcile to support periodic resync by
checking lastSyncTime against the effective resync period.

Integrate resync requeue scheduling into reconcileNormal and update the
status writer to set lastSyncTime on successful reconciliation. Wire
the DefaultResyncPeriod from manager options through to all controllers.

Add typed ExternallyDeleted signal on ReconcileStatus for safe
detection of externally deleted OpenStack resources.
Add E2E tests covering resync period, resync jitter, resync disabled,
and terminal error behavior during resync. Add integration tests for
handling unmanaged resources during resync to verify that imported
resources without resyncPeriod skip requeue scheduling.
Modify GetOrCreateOSResource to detect when a managed, non-imported
resource has been deleted externally from OpenStack. Signal the caller
via the typed ExternallyDeleted ReconcileStatus so it can clear
status.id and trigger recreation. Restore the safety guard for
unexpected nil returns from actuators.

Fix Available condition to correctly transition from Unknown to False
when a terminal error is present during resync.

Include unit tests for external deletion handling and E2E tests
covering both managed resource recreation and imported resource
terminal error behavior.
Add user-facing documentation covering periodic resync configuration,
drift detection behavior, and external deletion handling. Update the
enhancement proposal to reflect the final two-tier resync period
resolution design.
Replace the (nil, nil) sentinel pattern with a typed ExternallyDeleted
signal on ReconcileStatus. This is a design improvement over the
original proposal: using a dedicated signal prevents ambiguity with
actuator bugs that might also return nil, and restores the safety guard
for unexpected nil returns from GetOSResourceByID.

@dlaw4608 dlaw4608 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @eshulman2 , I ran it locally following the drift-detection.md and it works well, also ran some of the E2E tests, I left a few comments, please let me know what you think?

Comment thread website/docs/user-guide/drift-detection.md Outdated
Comment thread internal/controllers/generic/reconciler/resource_actions.go Outdated
Comment thread internal/controllers/generic/reconciler/resource_actions_test.go Outdated
Comment thread internal/controllers/generic/reconciler/resource_actions_test.go Outdated
Comment thread internal/controllers/generic/resync/scheduler.go Outdated
@eshulman2 eshulman2 requested a review from dlaw4608 June 10, 2026 13:24
- Simplify external-deletion condition: managed+import is rejected by
  CEL validation, so !IsImported() was redundant. Remove the dead check
  and delete the three unit tests that exercised the impossible scenario.
- Fix drift-detection.md: remove impossible managed+import rows from the
  summary table and the invalid YAML example showing managed+import.
- Strip TS-XXX internal labels from comments across all affected files.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Comment thread website/docs/user-guide/drift-detection.md
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@eshulman2 eshulman2 requested a review from dlaw4608 June 11, 2026 12:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hold Do not merge semver:minor Backwards-compatible change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants