Skip to content

Conversation

@nicolaiort
Copy link

@nicolaiort nicolaiort commented Feb 4, 2026

What this PR does / why we need it:

This PR increases the backoff used to wait for octavia lb deletes.
With the timeout/backoff config in the current release (1.35.0) and latest (built from master) the deletion took longer than the backoff leading to an error and zombie security groups were left after LB deletion (leading to quotas being filled up).

Which issue this PR fixes(if applicable):
I can open an issue if wanted/needed but the fix was pretty simple

Special notes for reviewers:

Feel free to suggest other ways of combating this behavior.
This was the simplest and non-breaking solution i could think of.

Release note:

NONE

@k8s-ci-robot k8s-ci-robot added the release-note-none Denotes a PR that doesn't merit a release note. label Feb 4, 2026
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Feb 4, 2026

CLA Signed

@k8s-ci-robot
Copy link
Contributor

Welcome @nicolaiort!

It looks like this is your first PR to kubernetes/cloud-provider-openstack 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/cloud-provider-openstack has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Feb 4, 2026
@k8s-ci-robot
Copy link
Contributor

Hi @nicolaiort. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Feb 4, 2026
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Feb 4, 2026
@dulek
Copy link
Contributor

dulek commented Feb 4, 2026

How did it happen that orphaned SGs were left? ensureLoadBalancerDeleted was interrupted somehow? It think the idea of deleting the SGs last comes from the fact that we can't have the LB exposed without the SG even during deletion - it's a security matter.

@nicolaiort
Copy link
Author

No direct interruption of ensureLoadBalancerDeleted as far as I am aware of.
With logging set to Debug (level 4) the last message was "Waiting for load balancer to be deleted" followed by a "Finished updating Service" and then

"Unhandled Error" err="error processing service default/test (retrying with exponential backoff): failed to delete load balancer: loadbalancer failed to delete within the allotted time" logger="UnhandledError"

That one might indicate an early exit but I'm new to the codebase and currently trying to learn all about it.

@nicolaiort
Copy link
Author

If this is the source of an early cancel, adjusting the backoff settings for the wait might be another solution but I only found hardcoded values for those.

@dulek
Copy link
Contributor

dulek commented Feb 4, 2026

Ah, most likely Octavia still deletes the LB, but request times out on CPO side. Then it finishes in the background, so LB is gone, but CPO never learned about it. Then ensureLoadBalancerDeleted is called again, but finishes early here as LB is deleted.

If you can confirm this is the case, then the proper fix is to actually cleanup the SG in that place linked above. It's weird nobody hit this earlier, it's a pretty obvious bug here, thanks for raising it.

@dulek
Copy link
Contributor

dulek commented Feb 4, 2026

/ok-to-test

So that you can work with CI, but please update the patch as discussed above.

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 4, 2026
@nicolaiort
Copy link
Author

That sounds reasonable, thank you for the feedback.
I had a previous revision of this PR (never managed to get pushed) that also moved the SG delete there.
I'll patch it, test the new behavior and get back to you.

@nicolaiort
Copy link
Author

Hm interesting: I moved the original SG delete back down to where it originated and added a second one before the early nil return and we're back to the original behavior: LB get's deleted (yes I really took a while - 1min19sec to be precise) but SG still lingers around.
If i add a log statement to the early return it never logs that line.
Give me a second and i'll push my current change.

@nicolaiort nicolaiort closed this Feb 4, 2026
@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 4, 2026
@nicolaiort
Copy link
Author

nicolaiort commented Feb 4, 2026

I honestly don't know how the PR managed to close itself, my bad
I'll reopen (or I can open a new PR with the changes if that is a better fit for your workflow)

@nicolaiort nicolaiort reopened this Feb 4, 2026
@nicolaiort
Copy link
Author

/retest-required

@dulek
Copy link
Contributor

dulek commented Feb 4, 2026

Does the LoadBalancer object get deleted in K8s in this case? My understanding is that when LB fails to be deleted it should keep the finalizer and try again a bit later, it shouldn't allow it to be deleted and be gone.

@nicolaiort
Copy link
Author

The service of type LoadBalancer get's deleted (that's what irritates me)

@nicolaiort
Copy link
Author

nicolaiort commented Feb 4, 2026

Other steps that are executed before the LB delete - e.G. releasing the floating ip - also work.
The only step seemingly being skipped is the SG delete (that's why i moved it above the LB in my initial PR).

@dulek
Copy link
Contributor

dulek commented Feb 4, 2026

Ouch, this sucks: https://github.com/kubernetes/cloud-provider/blob/master/controllers/service/controller.go#L375-L392. Obviously GetLoadBalancer will return nil here, so it doesn't even bother.

We could maybe hack this by returning non-nil when LB is gone, but SG still exists, but that feels very hacky and brittle as GetLoadBalancer is used in many places.

Can you start by trying to tweak this: https://github.com/kubernetes/cloud-provider-openstack/blob/master/pkg/util/openstack/loadbalancer.go#L53 for your env? Looks like the time it's waiting for is around 45 seconds, sounds like a bit low for Octavia.

I'm still opposed to just moving the SG deletion, but I understand the point, I'm looking at alternative approaches.

@nicolaiort
Copy link
Author

I picked 24 steps based on a gut estimation of 2mins max and that did the trick.
It had to wait for 1min2sec between the "Waiting for the load balancer deleted" and "Load balancer deleted".

This would be a fix for my use-case but I'm with you about the possibility of a better solution existing (magic numbers in code are always kinda sketchy).

@dulek
Copy link
Contributor

dulek commented Feb 4, 2026

Yeah, would probably be best to make this configurable. Please update the PR title and description and we can merge this.

@nicolaiort nicolaiort changed the title [occm] Delete security groups before lb [occm] Increase backoff for octavia loadbalancer deletion Feb 4, 2026
@nicolaiort
Copy link
Author

I updated both, thank you for your quick responses and helpful feedback.
I might look into making it configurable in the future, but I'll open another PR when that day arrives.

@dulek
Copy link
Contributor

dulek commented Feb 4, 2026

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 4, 2026
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dulek

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 4, 2026
@dulek
Copy link
Contributor

dulek commented Feb 4, 2026

Ah, openstack-cloud-controller-manager-e2e-test has to succeed and it's failing to download k3s. Unfortunately I don't have time to work to fix this at the moment, hopefully someone will grab this.

@nicolaiort
Copy link
Author

I believe it worked in a previous run.
Maybe a simple /retest-required does it's magic

@nicolaiort
Copy link
Author

/retest-required

@k8s-ci-robot
Copy link
Contributor

@nicolaiort: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
openstack-cloud-csi-manila-e2e-test 76ec1c2 link true /test openstack-cloud-csi-manila-e2e-test
openstack-cloud-controller-manager-e2e-test 76ec1c2 link true /test openstack-cloud-controller-manager-e2e-test
openstack-cloud-csi-cinder-e2e-test 76ec1c2 link true /test openstack-cloud-csi-cinder-e2e-test
openstack-cloud-controller-manager-ovn-e2e-test 76ec1c2 link true /test openstack-cloud-controller-manager-ovn-e2e-test

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants