Skip to content

Conversation

@tmshort
Copy link
Contributor

@tmshort tmshort commented Nov 12, 2025

…ation

This commit fixes OCPBUGS-62943 where two ClusterExtensionRevisions were being created during Helm-to-Boxcutter migration instead of just one.

Root causes:

  1. Manifest ordering inconsistency: CRDs from Helm release manifest and bundle manifest appeared in different orders, causing PhaseSort to produce different phase structures even though they contained the same objects.

  2. CollisionProtection mismatch: The migrated revision had collisionProtection=None (needed to adopt Helm-managed resources) while the bundle-generated revision had collisionProtection=Prevent (the default value).

Solution:

  1. Added deterministic sorting in PhaseSort (phase.go):

    • Sort objects within each phase by Group, Version, Kind, Namespace, Name
    • Ensures consistent phase structure regardless of input order
    • Critical for comparing revisions from different sources
  2. Added CollisionProtection preservation (boxcutter.go):

    • New preserveCollisionProtection() function copies CollisionProtection values from current revision to desired revision for matching objects
    • New objectKey() helper generates unique keys based on GVK+namespace+name
    • Called before patching to ensure CollisionProtection values match

With these changes, only a single ClusterExtensionRevision is created during Helm-to-Boxcutter migration, as expected.

🤖 Generated with Claude Code

Description

Reviewer Checklist

  • API Go Documentation
  • Tests: Unit Tests (and E2E Tests, if appropriate)
  • Comprehensive Commit Messages
  • Links to related GitHub Issue(s)

@tmshort tmshort requested a review from a team as a code owner November 12, 2025 20:46
@openshift-ci openshift-ci bot requested review from dtfranz and trgeiger November 12, 2025 20:46
@netlify
Copy link

netlify bot commented Nov 12, 2025

Deploy Preview for olmv1 ready!

Name Link
🔨 Latest commit 8b067ab
🔍 Latest deploy log https://app.netlify.com/projects/olmv1/deploys/691506e5a7f2bc0008a1d6b6
😎 Deploy Preview https://deploy-preview-2329--olmv1.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@openshift-ci
Copy link

openshift-ci bot commented Nov 12, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign thetechnick for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…ation

This commit fixes OCPBUGS-62943 where two ClusterExtensionRevisions were
being created during Helm-to-Boxcutter migration instead of just one.

Root causes:
1. Manifest ordering inconsistency: CRDs from Helm release manifest and
   bundle manifest appeared in different orders, causing PhaseSort to
   produce different phase structures even though they contained the same
   objects.

2. CollisionProtection mismatch: The migrated revision had
   collisionProtection=None (needed to adopt Helm-managed resources) while
   the bundle-generated revision had collisionProtection=Prevent (the
   default value).

Solution:

1. Added deterministic sorting in PhaseSort (phase.go):
   - Sort objects within each phase by Group, Version, Kind, Namespace, Name
   - Ensures consistent phase structure regardless of input order
   - Critical for comparing revisions from different sources

2. Added CollisionProtection preservation (boxcutter.go):
   - New preserveCollisionProtection() function copies CollisionProtection
     values from current revision to desired revision for matching objects
   - New objectKey() helper generates unique keys based on GVK+namespace+name
   - Called before patching to ensure CollisionProtection values match

With these changes, only a single ClusterExtensionRevision is created
during Helm-to-Boxcutter migration, as expected.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Todd Short <tshort@redhat.com>
@codecov
Copy link

codecov bot commented Nov 12, 2025

Codecov Report

❌ Patch coverage is 80.00000% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.31%. Comparing base (c95fc24) to head (8b067ab).

Files with missing lines Patch % Lines
internal/operator-controller/applier/phase.go 68.96% 5 Missing and 4 partials ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2329   +/-   ##
=======================================
  Coverage   74.30%   74.31%           
=======================================
  Files          91       91           
  Lines        7083     7128   +45     
=======================================
+ Hits         5263     5297   +34     
- Misses       1405     1411    +6     
- Partials      415      420    +5     
Flag Coverage Δ
e2e 45.36% <0.00%> (-0.29%) ⬇️
experimental-e2e 48.57% <80.00%> (+0.21%) ⬆️
unit 58.59% <62.22%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@joelanford
Copy link
Member

CollisionProtection mismatch: The migrated revision had collisionProtection=None (needed to adopt Helm-managed resources) while the bundle-generated revision had collisionProtection=Prevent (the default value).

I think this is intentional. We need None in order to adopt the existing objects. But then we want Prevent to ensure that we stop automatically adopting those objects in the future. I'm pretty sure that @thetechnick intended the adoption mechanics of None to be a one-time thing.

@perdasilva
Copy link
Contributor

CollisionProtection mismatch: The migrated revision had collisionProtection=None (needed to adopt Helm-managed resources) while the bundle-generated revision had collisionProtection=Prevent (the default value).

I think this is intentional. We need None in order to adopt the existing objects. But then we want Prevent to ensure that we stop automatically adopting those objects in the future. I'm pretty sure that @thetechnick intended the adoption mechanics of None to be a one-time thing.

I think the original behavior makes sense. We shouldn't keep None beyond the initial migration to assume a conservative posture by default going forward. Leaving it as None opens the cluster to possible unforeseen pain in the future. I'd also say it's fine to generate a new revision with the new collisionProtection setting to keep an audit trail.

// to ensure consistent ordering regardless of input order. This is critical for
// Helm-to-Boxcutter migration where the same resources may come from different sources
// (Helm release manifest vs bundle manifest) and need to produce identical phases.
func compareClusterExtensionRevision(a, b ocv1.ClusterExtensionRevisionObject) int {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func compareClusterExtensionRevision(a, b ocv1.ClusterExtensionRevisionObject) int {
func compareClusterExtensionRevisionObject(a, b ocv1.ClusterExtensionRevisionObject) int {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or even just compareRevisionObject ?

@tmshort
Copy link
Contributor Author

tmshort commented Nov 13, 2025

CollisionProtection mismatch: The migrated revision had collisionProtection=None (needed to adopt Helm-managed resources) while the bundle-generated revision had collisionProtection=Prevent (the default value).

I think this is intentional. We need None in order to adopt the existing objects. But then we want Prevent to ensure that we stop automatically adopting those objects in the future. I'm pretty sure that @thetechnick intended the adoption mechanics of None to be a one-time thing.

I think the original behavior makes sense. We shouldn't keep None beyond the initial migration to assume a conservative posture by default going forward. Leaving it as None opens the cluster to possible unforeseen pain in the future. I'd also say it's fine to generate a new revision with the new collisionProtection setting to keep an audit trail.

So, removing either of the changes will cause the bug (two CERs created) to assert itself. If we are OK with this, then we don't necessarily need this fix, although sorting the objects during the phase is probably still a good idea. @dtfranz and I found the same solution to the problem:

On its own, the manifest sanitization does not completely fix the other issue, but it does make it quite a bit better. Following the repro steps in OCPBUGS-62943 on main, I get infinite additional CERs. When I add my fix for OPRUN-4238, I only get one additional CER. This extra CER is being created due to the diff between revisions created by GenerateRevisionFromHelmRelease and GenerateRevision; the former sets the CollisionProtection flag, and the latter does not. The objects within the phases may also need to be sorted. If you resolve those differences, the extra CER does not get generated. I'm not 100% sure of the fix method (or, as @perdasilva mentioned, if it's even an issue that needs fixing), but I'm pretty sure we need to disregard the CollisionProtection flag when comparing observed vs desired CERs. If we do that then the controller should understand that no new revision is required.

If we're good with having two initial CERs upon transition from the helm applier to the boxcutter applier, then I can reduce this to be just the sorting changes, and remove the CollisionProtection change. I would then close the bug as intentional.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants