Skip to content

[GOBBLIN-2259] Coerce $UNKNOWN ExecutionStatus to PENDING in FlowExec…#4187

Open
DaisyModi wants to merge 1 commit intoapache:masterfrom
DaisyModi:GOBBLIN-2259-unknown-executionstatus-500
Open

[GOBBLIN-2259] Coerce $UNKNOWN ExecutionStatus to PENDING in FlowExec…#4187
DaisyModi wants to merge 1 commit intoapache:masterfrom
DaisyModi:GOBBLIN-2259-unknown-executionstatus-500

Conversation

@DaisyModi
Copy link
Copy Markdown
Contributor

…utionResource

ExecutionStatus.$UNKNOWN is a Pegasus in-memory sentinel, not a symbol declared in ExecutionStatus.pdl. When it reaches the response, Rest.li rejects it during serialization and returns HTTP 500 for the entire collection - a single poisoned record takes down every execution in the same latestFlowExecution batch.

Guard at the REST boundary: coerce $UNKNOWN to PENDING before building the FlowExecution response, and log a WARN so the underlying data-quality issue (missing flow-level status event or schema skew) stays observable.

Added tests:

  • testConvertFlowStatusCoercesUnknownToPending
  • testConvertFlowStatusPreservesValidStatus

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

Description

  • Here are some details about my PR, including screenshots (if applicable):

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

@DaisyModi DaisyModi force-pushed the GOBBLIN-2259-unknown-executionstatus-500 branch 3 times, most recently from d250e65 to 1ffcb20 Compare April 20, 2026 14:09
if (flowExecutionStatus == ExecutionStatus.$UNKNOWN) {
log.warn("FlowExecution {}/{}/{} has $UNKNOWN flow status; coercing to PENDING. Check state store for data quality issue.",
flowId.getFlowGroup(), flowId.getFlowName(), monitoringFlowStatus.getFlowExecutionId());
flowExecutionStatus = ExecutionStatus.PENDING;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why PENDING over RUNNING as the coercion target? RUNNING might be a closer semantic match than PENDING, since the flow is clearly active enough for executions to appear in the collection response.

}

@Test
public void testConvertFlowStatusPreservesValidStatus() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a terminal-status test case to testConvertFlowStatusPreservesValidStatus?

…utionResource

ExecutionStatus.$UNKNOWN is a Pegasus in-memory sentinel, not a symbol declared
in ExecutionStatus.pdl. When it reaches the response, Rest.li rejects it during
serialization and returns HTTP 500 for the entire collection — a single poisoned
record takes down every execution in the same latestFlowExecution batch.

This can happen when no flow-level (NA/NA) status event was persisted for an
execution, e.g., because ReevaluateDagProc early-returned without emitting the
flow-level event while a concurrent DagProc cleaned up the Dag. The result is a
"zombie" execution: job-level terminal rows exist in the state store, but no NA
row, so JobStatusRetriever.getFlowStatusFromJobStatuses returns its $UNKNOWN
default.

Guard at the REST boundary: coerce $UNKNOWN to PENDING before building the
FlowExecution response, and log a WARN so the underlying data-quality issue
stays observable. PENDING is the least-misleading valid enum value for "flow
exists but terminal state unknown" — polling clients continue polling, so we
never falsely report a terminal status we cannot verify.

Added tests:
- testConvertFlowStatusCoercesUnknownToPending
- testConvertFlowStatusPreservesValidStatus
@DaisyModi DaisyModi force-pushed the GOBBLIN-2259-unknown-executionstatus-500 branch from 1ffcb20 to 4a55a67 Compare April 21, 2026 06:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants