Fix KubernetesJobTrigger hang for parallelism > completions case (#64867)#65058
Open
holmuk wants to merge 1 commit intoapache:mainfrom
Open
Fix KubernetesJobTrigger hang for parallelism > completions case (#64867)#65058holmuk wants to merge 1 commit intoapache:mainfrom
holmuk wants to merge 1 commit intoapache:mainfrom
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR fixes a deferrable KubernetesJobOperator / KubernetesJobTrigger hang when Kubernetes parallelism > completions by making trigger completion primarily driven by the Job’s terminal state (Complete/Failed) rather than waiting for every pod from an initial “snapshot” to reach a terminal state. It also adds regression tests to cover the reported scenario (#64867).
Changes:
- Reworks
KubernetesJobTrigger.run()to wait for Job completion concurrently and collect XCom from pods on a best-effort basis (skipping missing/deleted pods). - Adds regression tests to ensure the trigger doesn’t hang when some snapshot pods never complete or are deleted.
- Adds an operator regression test verifying
execute_completetolerates partial XCom results.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/triggers/job.py |
Changes trigger control flow to be job-first and makes XCom extraction best-effort without blocking task finalization. |
providers/cncf/kubernetes/tests/unit/cncf/kubernetes/triggers/test_job.py |
Adds async regression tests for parallelism > completions pod snapshot edge cases and updates job polling assertion. |
providers/cncf/kubernetes/tests/unit/cncf/kubernetes/operators/test_job.py |
Adds regression test ensuring execute_complete handles partial XCom payload lists. |
providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/triggers/job.py
Outdated
Show resolved
Hide resolved
providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/triggers/job.py
Outdated
Show resolved
Hide resolved
475be15 to
55ab20b
Compare
jscheffl
approved these changes
Apr 11, 2026
Contributor
jscheffl
left a comment
There was a problem hiding this comment.
Looks good to me but I am not really an expert with K8s Jobs, so I have a hard time judging details of the fix. Looking for a second maintainer review.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #64867
Was generative AI tooling used to co-author this PR?
Cursor
This PR resolves the hanging
Runningstate issue inKubernetesJobOperator/KubernetesJobTriggerfordeferrable=True/do_xcom_push=True.Problem description
The trigger waits for container completion for every pod name from a precomputed snapshot (pod_names) before checking the final Job status. That snapshot is built from pod discovery tied to parallelism, not to actual successful completions.
Example (
parallelism=2,completions=1):Complete(completions=1reached)KubernetesJobTriggerkeeps waiting on the second pod and does not reach Job-status evaluation, so the task can remain Running/Deferred forever.Proposed fix: Task completion should be driven by Job terminal status (
Complete/Failed), which already reflectscompletions:What does this PR do?
Updates logic for
KubernetesJobTrigger:Regresssion tests for #64867
Trigger regression tests (triggers/test_job.py)
test_run_completes_when_job_is_done_even_if_some_snapshot_pods_never_complete: Verify the trigger does not hang when a pod from the initial snapshot never reaches terminal state after the Job is already complete.test_run_skips_deleted_snapshot_pod_and_completes_when_job_is_done: verifies the trigger handles stale snapshot pods gracefully by skipping 404 Not Found pods and still finishing successfully with available XCom results.test_run_collects_later_pod_xcom_best_effort_after_job_done: verifies post-completion best-effort behavior: once the Job is already complete, the trigger continues processing remaining snapshot pods, skips per-pod extraction failures, and still returns XCom from pods that can be read.Operator regression test (operators/test_job.py)
test_execute_complete_supports_partial_xcom_results: Verifyexecute_completecorrectly handles partialxcom_resultpayloads (fewer XCom entries than initial pod snapshot), which is expected inparallelism > completionsscenarios.Additional tests for new code
test_wait_until_container_state_or_job_done_does_not_restart_wait_task: Copilot pointed out that the naive implementation of the waiting loop may not work as expected on slow clusters because of constant coroutine retrying. The test validates that we don't recreatewait_methodon every tick on a slow cluster.Behavior change
xcom_resultmay be partial (fewer entries than initialpod_names) and this is expected.404) do not fail task completion.Risks
poll_intervalvalues, the new bounded wait loop may generate extra timeout/cancel/retry iterations while waiting for pod container states. This does not fail the task by itself (it is expected retry behavior), but it can increase polling overhead and log noise until the Job reaches a terminal state.succeeded,skipped_missing,timed_out,failed_other).{pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.