Skip to content

OCPBUGS-65634: UPSTREAM: <carry>: add service account to curl job#653

Open
ehearne-redhat wants to merge 1 commit intoopenshift:mainfrom
ehearne-redhat:add-service-account-curl-job-2
Open

OCPBUGS-65634: UPSTREAM: <carry>: add service account to curl job#653
ehearne-redhat wants to merge 1 commit intoopenshift:mainfrom
ehearne-redhat:add-service-account-curl-job-2

Conversation

@ehearne-redhat
Copy link
Contributor

@ehearne-redhat ehearne-redhat commented Mar 2, 2026

This PR addresses the revert that occurred in #638 .

We believe the changes should ensure the Pending state of verify pods does not happen, because:

  1. There is one DeferCleanup() block being used to handle job and service account cleanup.
  2. The job is forcefully deleted, with deletePolicy := metav1.DeletePropagationForeground ensuring all dependent pods (which there should be only one) are deleted and gracePeriod := int64(0) ensuring an instant deletion. The job can be forcefully deleted, because DeferCleanup() would not occur unless the job successfully ran.
  3. The Eventually() block between the job deletion and service account deletion will prevent service account deletion from occurring if there was an issue with deleting the job. This means that any pods still running associated with that job should not get stuck in a Pending state for service account related reasons.
  4. Unique job ID so that jobs that run in parallel won't collide in terms of resource usage.

We are hopeful that the following changes should not cause a revert, and it should be using proper commit message formatting.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 2, 2026
@openshift-ci-robot openshift-ci-robot added jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 2, 2026
@openshift-ci-robot
Copy link

@ehearne-redhat: This pull request references Jira Issue OCPBUGS-65634, which is invalid:

  • expected the bug to be in one of the following states: NEW, ASSIGNED, POST, but it is Verified instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

This PR addresses the revert that occurred in #638 .

We believe the changes should ensure the Pending state of verify pods does not happen, because:

  1. There is one DeferCleanup() block being used to handle job and service account cleanup.
  2. The job is forcefully deleted, with deletePolicy := metav1.DeletePropagationForeground ensuring all dependent pods (which there should be only one) are deleted and gracePeriod := int64(0) ensuring an instant deletion. The job can be forcefully deleted, because DeferCleanup() would not occur unless the job successfully ran.
  3. The Eventually() block between the job deletion and service account deletion will prevent service account deletion from occurring if there was an issue with deleting the job. This means that any pods still running associated with that job should not get stuck in a Pending state for service account related reasons.

We are hopeful that the following changes should not cause a revert, and it should be using proper commit message formatting.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from grokspawn and jianzhangbjz March 2, 2026 14:27
@ehearne-redhat ehearne-redhat force-pushed the add-service-account-curl-job-2 branch from 10893ab to 44e5264 Compare March 2, 2026 14:46
@ehearne-redhat ehearne-redhat changed the title [WIP] OCPBUGS-65634: UPSTREAM: <carry>: add service account to curl job OCPBUGS-65634: UPSTREAM: <carry>: add service account to curl job Mar 3, 2026
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 3, 2026
// Create the Job
job := buildCurlJob(jobNamePrefix, "default", serviceURL, serviceAccount.Name)
err = k8sClient.Create(ctx, job)
Expect(err).NotTo(HaveOccurred(), "failed to create Job")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as it stands, you might have the service account hanging around if there is an issue creating the job. If we have something deleting the namespace, then that will likely clean up the service account. We may want to either add a comment to call this out, or if there is an error creating the job, delete the service account before raising the error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is possible to delete the default ns.

ehearne-mac:~ ehearne$ kubectl delete ns default
Error from server (Forbidden): namespaces "default" is forbidden: this namespace may not be deleted

If we were to move the job to a separate namespace just for running the jobs, and then when they are all done, delete the ns, then we can clean up completely.

Otherwise we could delete the service account when there is an error creating.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, my bad. I forgot this was on the default ns. Though, the point my still stand, if there's an error creating the job, the service account might not get cleaned up. So you may want to either register a cleanup fn (if its LIFO) or check for job creation errors and cleanup the service account before raising the error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. Thanks for that!

The conformance test failed on things besides this issue, so I take that as a good sign! I have moved the defer cleanup to just after job instantiation. I'll wait to see what the tests say. If they are good, I'll re-ping for review and if good, I will run another aggregate job just in case.

@ehearne-redhat
Copy link
Contributor Author

/payload-aggregate 4.22 10

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 3, 2026

@ehearne-redhat: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

@ehearne-redhat
Copy link
Contributor Author

/test ?

@ehearne-redhat
Copy link
Contributor Author

/payload-aggregate openshift-e2e-aws 10

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 3, 2026

@ehearne-redhat: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

@ehearne-redhat
Copy link
Contributor Author

/payload-aggregate pull-ci-openshift-operator-framework-operator-controller-main-openshift-e2e-aws 10

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 3, 2026

@ehearne-redhat: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

@ehearne-redhat
Copy link
Contributor Author

/payload-aggregate ?

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 3, 2026

@ehearne-redhat: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info.

@ehearne-redhat
Copy link
Contributor Author

/payload-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 10

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 3, 2026

@ehearne-redhat: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/aa6f9100-16eb-11f1-8196-e62b76003c3a-0

@ehearne-redhat
Copy link
Contributor Author

Since the CI Image issue is fixed:

/payload-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 10

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 4, 2026

@ehearne-redhat: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/aeec5390-17bc-11f1-94d3-08f7c2305c14-0

@ehearne-redhat
Copy link
Contributor Author

We'll try again since there was a successful run with this test today.

/payload-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 10

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 4, 2026

@ehearne-redhat: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/6e092170-17ce-11f1-8136-2a053cae4c37-0

@ehearne-redhat
Copy link
Contributor Author

/payload-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 10

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 4, 2026

@ehearne-redhat: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/c6d112a0-17eb-11f1-8031-0715f06dd96d-0

@ehearne-redhat ehearne-redhat force-pushed the add-service-account-curl-job-2 branch from 44e5264 to 63fa9fb Compare March 5, 2026 11:30
@grokspawn
Copy link
Contributor

/retest-required

@ehearne-redhat ehearne-redhat force-pushed the add-service-account-curl-job-2 branch from 63fa9fb to a599c04 Compare March 6, 2026 10:11
@ehearne-redhat
Copy link
Contributor Author

/test verify-commits

@ehearne-redhat
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 6, 2026
@openshift-ci-robot
Copy link

@ehearne-redhat: This pull request references Jira Issue OCPBUGS-65634, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @kuiwang02

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from kuiwang02 March 6, 2026 14:15
@perdasilva
Copy link
Contributor

/approve

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 6, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ehearne-redhat, perdasilva

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 6, 2026
@ehearne-redhat
Copy link
Contributor Author

I'll go ahead and run the aggregate tests one more time - thanks so much @perdasilva for the approve!

/payload-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 10

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 6, 2026

@ehearne-redhat: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/e9720420-1969-11f1-8099-5aea677a4869-0

@grokspawn
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 6, 2026
@ehearne-redhat
Copy link
Contributor Author

@grokspawn thanks so much for the LGTM! I'll await the aggregate results before verifying. :)

Comment on lines -97 to +127
Fail(fmt.Sprintf("Job failed: %s", c.Message))
StopTrying(fmt.Sprintf("Job failed: %s", c.Message)).Now()
Copy link
Contributor

@everettraven everettraven Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Is this change necessary? What does StopTrying().Now() get you over Fail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I am aware, StopTrying() is better practice than Fail() in Eventually() blocks.

https://onsi.github.io/gomega/#bailing-out-early---polling-functions

})
if err != nil && !apierrors.IsNotFound(err) {
Expect(err).NotTo(HaveOccurred(), "failed to delete Job")
}
Copy link
Contributor

@everettraven everettraven Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all errors other than NotFound that can be returned from a forceful delete attempt something we consider terminal and should not retry the delete?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. We can definitely not make this as concrete, maybe utilising Eventually() or something to account for this scenario. That way if race conditions occur, we can poll for a successful deletion.

I'll wait for the aggregate tests to come back and gauge from there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the aggregate tests failed after 10 minutes due to infrastructure issues. :D

I'll implement this suggestion. Good catch - another way to ensure less flakiness. :)

@ehearne-redhat ehearne-redhat force-pushed the add-service-account-curl-job-2 branch from a599c04 to 6b9cb16 Compare March 6, 2026 16:30
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Mar 6, 2026
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 6, 2026

New changes are detected. LGTM label has been removed.

@tmshort
Copy link
Contributor

tmshort commented Mar 9, 2026

/retest

@ehearne-redhat
Copy link
Contributor Author

@tmshort it looks like my excessive cleanups aren't working. :) I'll look into it and fix.

@ehearne-redhat ehearne-redhat force-pushed the add-service-account-curl-job-2 branch from 6b9cb16 to 608f5f0 Compare March 10, 2026 10:43
@ehearne-redhat ehearne-redhat force-pushed the add-service-account-curl-job-2 branch from 608f5f0 to c8a069c Compare March 10, 2026 13:00
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 10, 2026

@ehearne-redhat: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/openshift-e2e-aws c8a069c link true /test openshift-e2e-aws
ci/prow/e2e-aws-upgrade-ovn-single-node c8a069c link false /test e2e-aws-upgrade-ovn-single-node
ci/prow/e2e-aws-olmv1-ext c8a069c link true /test e2e-aws-olmv1-ext
ci/prow/e2e-aws-techpreview-olmv1-ext c8a069c link true /test e2e-aws-techpreview-olmv1-ext

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

// guaranteeing the ServiceAccount isn't deleted while Pods are still using it
deletePolicy := metav1.DeletePropagationForeground
gracePeriod := int64(0)
// Poll for service account deletion - in case we have race condtions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess here is waiting for job deletion, not SA. And, a typo error. If yes, it should be Poll for job deletion - in case we have race conditions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants