-
Notifications
You must be signed in to change notification settings - Fork 660
[occm] Increase backoff for octavia loadbalancer deletion #3072
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Welcome @nicolaiort! |
|
Hi @nicolaiort. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
How did it happen that orphaned SGs were left? |
|
No direct interruption of That one might indicate an early exit but I'm new to the codebase and currently trying to learn all about it. |
|
If this is the source of an early cancel, adjusting the backoff settings for the wait might be another solution but I only found hardcoded values for those. |
|
Ah, most likely Octavia still deletes the LB, but request times out on CPO side. Then it finishes in the background, so LB is gone, but CPO never learned about it. Then If you can confirm this is the case, then the proper fix is to actually cleanup the SG in that place linked above. It's weird nobody hit this earlier, it's a pretty obvious bug here, thanks for raising it. |
|
/ok-to-test So that you can work with CI, but please update the patch as discussed above. |
|
That sounds reasonable, thank you for the feedback. |
|
Hm interesting: I moved the original SG delete back down to where it originated and added a second one before the early |
|
I honestly don't know how the PR managed to close itself, my bad |
|
/retest-required |
|
Does the |
|
The service of type LoadBalancer get's deleted (that's what irritates me) |
|
Other steps that are executed before the LB delete - e.G. releasing the floating ip - also work. |
|
Ouch, this sucks: https://github.com/kubernetes/cloud-provider/blob/master/controllers/service/controller.go#L375-L392. Obviously We could maybe hack this by returning non-nil when LB is gone, but SG still exists, but that feels very hacky and brittle as Can you start by trying to tweak this: https://github.com/kubernetes/cloud-provider-openstack/blob/master/pkg/util/openstack/loadbalancer.go#L53 for your env? Looks like the time it's waiting for is around 45 seconds, sounds like a bit low for Octavia. I'm still opposed to just moving the SG deletion, but I understand the point, I'm looking at alternative approaches. |
|
I picked 24 steps based on a gut estimation of 2mins max and that did the trick. This would be a fix for my use-case but I'm with you about the possibility of a better solution existing (magic numbers in code are always kinda sketchy). |
|
Yeah, would probably be best to make this configurable. Please update the PR title and description and we can merge this. |
|
I updated both, thank you for your quick responses and helpful feedback. |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dulek The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Ah, |
|
I believe it worked in a previous run. |
|
/retest-required |
|
@nicolaiort: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What this PR does / why we need it:
This PR increases the backoff used to wait for octavia lb deletes.
With the timeout/backoff config in the current release (1.35.0) and latest (built from master) the deletion took longer than the backoff leading to an error and zombie security groups were left after LB deletion (leading to quotas being filled up).
Which issue this PR fixes(if applicable):
I can open an issue if wanted/needed but the fix was pretty simple
Special notes for reviewers:
Feel free to suggest other ways of combating this behavior.
This was the simplest and non-breaking solution i could think of.
Release note: