[occm] Increase backoff for octavia loadbalancer deletion #3072

nicolaiort · 2026-02-04T10:14:33Z

What this PR does / why we need it:

This PR increases the backoff used to wait for octavia lb deletes.
With the timeout/backoff config in the current release (1.35.0) and latest (built from master) the deletion took longer than the backoff leading to an error and zombie security groups were left after LB deletion (leading to quotas being filled up).

Which issue this PR fixes(if applicable):
I can open an issue if wanted/needed but the fix was pretty simple

Special notes for reviewers:

Feel free to suggest other ways of combating this behavior.
This was the simplest and non-breaking solution i could think of.

Release note:

NONE

linux-foundation-easycla · 2026-02-04T10:14:42Z

k8s-ci-robot · 2026-02-04T10:14:43Z

Welcome @nicolaiort!

It looks like this is your first PR to kubernetes/cloud-provider-openstack 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/cloud-provider-openstack has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2026-02-04T10:14:44Z

Hi @nicolaiort. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

dulek · 2026-02-04T10:19:35Z

How did it happen that orphaned SGs were left? ensureLoadBalancerDeleted was interrupted somehow? It think the idea of deleting the SGs last comes from the fact that we can't have the LB exposed without the SG even during deletion - it's a security matter.

nicolaiort · 2026-02-04T10:26:10Z

No direct interruption of ensureLoadBalancerDeleted as far as I am aware of.
With logging set to Debug (level 4) the last message was "Waiting for load balancer to be deleted" followed by a "Finished updating Service" and then

"Unhandled Error" err="error processing service default/test (retrying with exponential backoff): failed to delete load balancer: loadbalancer failed to delete within the allotted time" logger="UnhandledError"

That one might indicate an early exit but I'm new to the codebase and currently trying to learn all about it.

nicolaiort · 2026-02-04T10:34:47Z

If this is the source of an early cancel, adjusting the backoff settings for the wait might be another solution but I only found hardcoded values for those.

dulek · 2026-02-04T10:57:57Z

Ah, most likely Octavia still deletes the LB, but request times out on CPO side. Then it finishes in the background, so LB is gone, but CPO never learned about it. Then ensureLoadBalancerDeleted is called again, but finishes early here as LB is deleted.

If you can confirm this is the case, then the proper fix is to actually cleanup the SG in that place linked above. It's weird nobody hit this earlier, it's a pretty obvious bug here, thanks for raising it.

dulek · 2026-02-04T10:59:11Z

/ok-to-test

So that you can work with CI, but please update the patch as discussed above.

nicolaiort · 2026-02-04T11:21:24Z

That sounds reasonable, thank you for the feedback.
I had a previous revision of this PR (never managed to get pushed) that also moved the SG delete there.
I'll patch it, test the new behavior and get back to you.

nicolaiort · 2026-02-04T11:40:40Z

Hm interesting: I moved the original SG delete back down to where it originated and added a second one before the early nil return and we're back to the original behavior: LB get's deleted (yes I really took a while - 1min19sec to be precise) but SG still lingers around.
If i add a log statement to the early return it never logs that line.
Give me a second and i'll push my current change.

nicolaiort · 2026-02-04T11:46:53Z

I honestly don't know how the PR managed to close itself, my bad
I'll reopen (or I can open a new PR with the changes if that is a better fit for your workflow)

nicolaiort · 2026-02-04T12:05:18Z

/retest-required

dulek · 2026-02-04T12:05:42Z

Does the LoadBalancer object get deleted in K8s in this case? My understanding is that when LB fails to be deleted it should keep the finalizer and try again a bit later, it shouldn't allow it to be deleted and be gone.

nicolaiort · 2026-02-04T12:06:41Z

The service of type LoadBalancer get's deleted (that's what irritates me)

nicolaiort · 2026-02-04T12:14:35Z

Other steps that are executed before the LB delete - e.G. releasing the floating ip - also work.
The only step seemingly being skipped is the SG delete (that's why i moved it above the LB in my initial PR).

dulek · 2026-02-04T12:19:35Z

Ouch, this sucks: https://github.com/kubernetes/cloud-provider/blob/master/controllers/service/controller.go#L375-L392. Obviously GetLoadBalancer will return nil here, so it doesn't even bother.

We could maybe hack this by returning non-nil when LB is gone, but SG still exists, but that feels very hacky and brittle as GetLoadBalancer is used in many places.

Can you start by trying to tweak this: https://github.com/kubernetes/cloud-provider-openstack/blob/master/pkg/util/openstack/loadbalancer.go#L53 for your env? Looks like the time it's waiting for is around 45 seconds, sounds like a bit low for Octavia.

I'm still opposed to just moving the SG deletion, but I understand the point, I'm looking at alternative approaches.

nicolaiort · 2026-02-04T12:31:45Z

I picked 24 steps based on a gut estimation of 2mins max and that did the trick.
It had to wait for 1min2sec between the "Waiting for the load balancer deleted" and "Load balancer deleted".

This would be a fix for my use-case but I'm with you about the possibility of a better solution existing (magic numbers in code are always kinda sketchy).

dulek · 2026-02-04T13:05:13Z

Yeah, would probably be best to make this configurable. Please update the PR title and description and we can merge this.

nicolaiort · 2026-02-04T13:15:04Z

I updated both, thank you for your quick responses and helpful feedback.
I might look into making it configurable in the future, but I'll open another PR when that day arrives.

dulek · 2026-02-04T13:47:11Z

/lgtm
/approve

k8s-ci-robot · 2026-02-04T13:47:22Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dulek

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [dulek]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

dulek · 2026-02-04T13:53:18Z

Ah, openstack-cloud-controller-manager-e2e-test has to succeed and it's failing to download k3s. Unfortunately I don't have time to work to fix this at the moment, hopefully someone will grab this.

nicolaiort · 2026-02-04T14:05:45Z

I believe it worked in a previous run.
Maybe a simple /retest-required does it's magic

nicolaiort · 2026-02-04T14:12:34Z

/retest-required

k8s-ci-robot · 2026-02-04T14:52:55Z

@nicolaiort: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
openstack-cloud-csi-manila-e2e-test	`76ec1c2`	link	true	`/test openstack-cloud-csi-manila-e2e-test`
openstack-cloud-controller-manager-e2e-test	`76ec1c2`	link	true	`/test openstack-cloud-controller-manager-e2e-test`
openstack-cloud-csi-cinder-e2e-test	`76ec1c2`	link	true	`/test openstack-cloud-csi-cinder-e2e-test`
openstack-cloud-controller-manager-ovn-e2e-test	`76ec1c2`	link	true	`/test openstack-cloud-controller-manager-ovn-e2e-test`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

k8s-ci-robot added the release-note-none Denotes a PR that doesn't merit a release note. label Feb 4, 2026

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Feb 4, 2026

k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Feb 4, 2026

k8s-ci-robot requested review from anguslees and dulek February 4, 2026 10:14

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Feb 4, 2026

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 4, 2026

nicolaiort closed this Feb 4, 2026

nicolaiort force-pushed the master branch from b0b66c5 to a031201 Compare February 4, 2026 11:44

k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 4, 2026

fix(loadbalancer): Added SG delete to ealry return if lb does not exist

19e7b4a

nicolaiort reopened this Feb 4, 2026

chore(loadbalancers): Remove log for early return

b2b9378

fix(loadbalancer): Increase delete timeout by tweaking steps

76ec1c2

nicolaiort changed the title ~~[occm] Delete security groups before lb~~ [occm] Increase backoff for octavia loadbalancer deletion Feb 4, 2026

k8s-ci-robot assigned dulek Feb 4, 2026

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 4, 2026

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 4, 2026

[occm] Increase backoff for octavia loadbalancer deletion #3072

Are you sure you want to change the base?

[occm] Increase backoff for octavia loadbalancer deletion #3072

Uh oh!

Conversation

nicolaiort commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

linux-foundation-easycla bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Feb 4, 2026

Uh oh!

k8s-ci-robot commented Feb 4, 2026

Uh oh!

dulek commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nicolaiort commented Feb 4, 2026

Uh oh!

nicolaiort commented Feb 4, 2026

Uh oh!

dulek commented Feb 4, 2026

Uh oh!

dulek commented Feb 4, 2026

Uh oh!

nicolaiort commented Feb 4, 2026

Uh oh!

nicolaiort commented Feb 4, 2026

Uh oh!

nicolaiort commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nicolaiort commented Feb 4, 2026

Uh oh!

dulek commented Feb 4, 2026

Uh oh!

nicolaiort commented Feb 4, 2026

Uh oh!

nicolaiort commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dulek commented Feb 4, 2026

Uh oh!

nicolaiort commented Feb 4, 2026

Uh oh!

dulek commented Feb 4, 2026

Uh oh!

nicolaiort commented Feb 4, 2026

Uh oh!

dulek commented Feb 4, 2026

Uh oh!

k8s-ci-robot commented Feb 4, 2026

Uh oh!

dulek commented Feb 4, 2026

Uh oh!

nicolaiort commented Feb 4, 2026

Uh oh!

nicolaiort commented Feb 4, 2026

Uh oh!

k8s-ci-robot commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nicolaiort commented Feb 4, 2026 •

edited

Loading

linux-foundation-easycla bot commented Feb 4, 2026 •

edited

Loading

dulek commented Feb 4, 2026 •

edited

Loading

nicolaiort commented Feb 4, 2026 •

edited

Loading

nicolaiort commented Feb 4, 2026 •

edited

Loading