OCPBUGS-65634: UPSTREAM: <carry>: add service account to curl job#653
OCPBUGS-65634: UPSTREAM: <carry>: add service account to curl job#653ehearne-redhat wants to merge 1 commit intoopenshift:mainfrom
Conversation
|
@ehearne-redhat: This pull request references Jira Issue OCPBUGS-65634, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
10893ab to
44e5264
Compare
| // Create the Job | ||
| job := buildCurlJob(jobNamePrefix, "default", serviceURL, serviceAccount.Name) | ||
| err = k8sClient.Create(ctx, job) | ||
| Expect(err).NotTo(HaveOccurred(), "failed to create Job") |
There was a problem hiding this comment.
as it stands, you might have the service account hanging around if there is an issue creating the job. If we have something deleting the namespace, then that will likely clean up the service account. We may want to either add a comment to call this out, or if there is an error creating the job, delete the service account before raising the error
There was a problem hiding this comment.
I don't think it is possible to delete the default ns.
ehearne-mac:~ ehearne$ kubectl delete ns default
Error from server (Forbidden): namespaces "default" is forbidden: this namespace may not be deletedIf we were to move the job to a separate namespace just for running the jobs, and then when they are all done, delete the ns, then we can clean up completely.
Otherwise we could delete the service account when there is an error creating.
There was a problem hiding this comment.
Oh, my bad. I forgot this was on the default ns. Though, the point my still stand, if there's an error creating the job, the service account might not get cleaned up. So you may want to either register a cleanup fn (if its LIFO) or check for job creation errors and cleanup the service account before raising the error.
There was a problem hiding this comment.
That's a good point. Thanks for that!
The conformance test failed on things besides this issue, so I take that as a good sign! I have moved the defer cleanup to just after job instantiation. I'll wait to see what the tests say. If they are good, I'll re-ping for review and if good, I will run another aggregate job just in case.
|
/payload-aggregate 4.22 10 |
|
@ehearne-redhat: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command |
|
/test ? |
|
/payload-aggregate openshift-e2e-aws 10 |
|
@ehearne-redhat: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command |
|
/payload-aggregate pull-ci-openshift-operator-framework-operator-controller-main-openshift-e2e-aws 10 |
|
@ehearne-redhat: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command |
|
/payload-aggregate ? |
|
@ehearne-redhat: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info. |
|
/payload-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 10 |
|
@ehearne-redhat: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/aa6f9100-16eb-11f1-8196-e62b76003c3a-0 |
|
Since the CI Image issue is fixed: /payload-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 10 |
|
@ehearne-redhat: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/aeec5390-17bc-11f1-94d3-08f7c2305c14-0 |
|
We'll try again since there was a successful run with this test today. /payload-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 10 |
|
@ehearne-redhat: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/6e092170-17ce-11f1-8136-2a053cae4c37-0 |
|
/payload-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 10 |
|
@ehearne-redhat: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/c6d112a0-17eb-11f1-8031-0715f06dd96d-0 |
44e5264 to
63fa9fb
Compare
|
/retest-required |
63fa9fb to
a599c04
Compare
|
/test verify-commits |
|
/jira refresh |
|
@ehearne-redhat: This pull request references Jira Issue OCPBUGS-65634, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ehearne-redhat, perdasilva The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
I'll go ahead and run the aggregate tests one more time - thanks so much @perdasilva for the approve! /payload-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 10 |
|
@ehearne-redhat: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/e9720420-1969-11f1-8099-5aea677a4869-0 |
|
/lgtm |
|
@grokspawn thanks so much for the LGTM! I'll await the aggregate results before verifying. :) |
| Fail(fmt.Sprintf("Job failed: %s", c.Message)) | ||
| StopTrying(fmt.Sprintf("Job failed: %s", c.Message)).Now() |
There was a problem hiding this comment.
Nit: Is this change necessary? What does StopTrying().Now() get you over Fail?
There was a problem hiding this comment.
As far as I am aware, StopTrying() is better practice than Fail() in Eventually() blocks.
https://onsi.github.io/gomega/#bailing-out-early---polling-functions
| }) | ||
| if err != nil && !apierrors.IsNotFound(err) { | ||
| Expect(err).NotTo(HaveOccurred(), "failed to delete Job") | ||
| } |
There was a problem hiding this comment.
Are all errors other than NotFound that can be returned from a forceful delete attempt something we consider terminal and should not retry the delete?
There was a problem hiding this comment.
This makes sense. We can definitely not make this as concrete, maybe utilising Eventually() or something to account for this scenario. That way if race conditions occur, we can poll for a successful deletion.
I'll wait for the aggregate tests to come back and gauge from there.
There was a problem hiding this comment.
Looks like the aggregate tests failed after 10 minutes due to infrastructure issues. :D
I'll implement this suggestion. Good catch - another way to ensure less flakiness. :)
a599c04 to
6b9cb16
Compare
|
New changes are detected. LGTM label has been removed. |
|
/retest |
|
@tmshort it looks like my excessive cleanups aren't working. :) I'll look into it and fix. |
6b9cb16 to
608f5f0
Compare
608f5f0 to
c8a069c
Compare
|
@ehearne-redhat: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
| // guaranteeing the ServiceAccount isn't deleted while Pods are still using it | ||
| deletePolicy := metav1.DeletePropagationForeground | ||
| gracePeriod := int64(0) | ||
| // Poll for service account deletion - in case we have race condtions |
There was a problem hiding this comment.
I guess here is waiting for job deletion, not SA. And, a typo error. If yes, it should be Poll for job deletion - in case we have race conditions.
This PR addresses the revert that occurred in #638 .
We believe the changes should ensure the
Pendingstate of verify pods does not happen, because:DeferCleanup()block being used to handle job and service account cleanup.deletePolicy := metav1.DeletePropagationForegroundensuring all dependent pods (which there should be only one) are deleted andgracePeriod := int64(0)ensuring an instant deletion. The job can be forcefully deleted, becauseDeferCleanup()would not occur unless the job successfully ran.Eventually()block between the job deletion and service account deletion will prevent service account deletion from occurring if there was an issue with deleting the job. This means that any pods still running associated with that job should not get stuck in aPendingstate for service account related reasons.We are hopeful that the following changes should not cause a revert, and it should be using proper commit message formatting.