Correct Helix Job Monitor final-summary work-item accounting#16989
Open
Copilot wants to merge 3 commits into
Open
Correct Helix Job Monitor final-summary work-item accounting#16989Copilot wants to merge 3 commits into
Copilot wants to merge 3 commits into
Conversation
Copilot
AI
changed the title
[WIP] Fix work item count mismatch in Helix Job Monitor final summary
Correct Helix Job Monitor final-summary work-item accounting
Jun 8, 2026
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates Helix Job Monitor’s end-of-run reporting so the final summary work-item counts align with the monitor’s processed-job reconciliation (rather than deduplicated logical outcome keys), and so the final aggregated failed-work-item block retains failures for same-named work items across different Helix jobs.
Changes:
- Final summary work-item totals now use new processed counters (
ProcessedWorkItemCount,FailedProcessedWorkItemCount) instead of dedupedWorkItemOutcomes.Count. - Failed-work-item console aggregation is now keyed per Helix job + work item to avoid collapsing same-named failures across jobs.
- Adds a regression test covering two same-queue Helix jobs failing the same work item name and asserting both failures appear in the final failure block and counts.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/Microsoft.DotNet.Helix/Sdk.Tests/Microsoft.DotNet.Helix.Sdk.Tests/JobMonitorRunnerTests.cs | Adds regression coverage for two same-queue jobs failing the same work-item name and validates final reporting. |
| src/Microsoft.DotNet.Helix/JobMonitor/StatusReporter.cs | Switches final summary work-item counts to use processed reconciliation counters. |
| src/Microsoft.DotNet.Helix/JobMonitor/MonitorState.cs | Adds processed work-item counters and changes failed-console info keying to per-job granularity. |
| src/Microsoft.DotNet.Helix/JobMonitor/JobMonitorRunner.cs | Increments new processed work-item counters during completed-job reconciliation and updates failure tracking call-site. |
Comments suppressed due to low confidence (1)
src/Microsoft.DotNet.Helix/JobMonitor/MonitorState.cs:220
- Keying failed-console info by
(HelixJobName, WorkItemName)prevents a later retry (new Helix job name) from clearing a failure recorded for the same logical work item in the same submitter chain. That means a resubmitted work item that eventually passes can still show up in the finalFailed work item console logsblock (and be logged as an AzDO error), contradicting the method/doc intent (“Removal happens when a later incarnation passes”). The failure aggregation needs to preserve per-job visibility and still prune superseded failures when a newer attempt in the same chain passes.
/// <summary>
/// Tracks (or removes) the per-failure console-info record for a single observed
/// work item. Removal happens when a later incarnation passes.
/// </summary>
public void TrackFailedWorkItemConsoleInfo(HelixJobInfo helixJob, WorkItemSummary workItem)
{
var key = (helixJob.JobName, workItem.Name);
if (workItem.IsFailed)
{
FailedWorkItemConsoleInfo[key] = new FailedWorkItemConsoleInfo(
helixJob.DisplayName,
workItem.Name,
workItem.FormattedState,
GetConsoleOutputText(workItem.ConsoleOutputUri));
}
else
{
FailedWorkItemConsoleInfo.Remove(key);
}
Comment on lines
+59
to
65
| /// Latest known console-link information for every failed work item attempt, keyed by | ||
| /// (JobName, WorkItemName) where <c>JobName</c> is the Helix job name. Used to build | ||
| /// the final aggregated failure report | ||
| /// without collapsing failures across different jobs. | ||
| /// </summary> | ||
| public Dictionary<(string ChainKey, string WorkItemName), FailedWorkItemConsoleInfo> FailedWorkItemConsoleInfo { get; } | ||
| public Dictionary<(string JobName, string WorkItemName), FailedWorkItemConsoleInfo> FailedWorkItemConsoleInfo { get; } | ||
| = new(WorkItemOutcomeKeyComparer.Instance); |
Comment on lines
147
to
+155
| + " Work items: {TotalWorkItems} submitted / {ResubmittedWorkItems} resubmitted / {FailedWorkItems} failed", | ||
| Environment.NewLine, | ||
| totalAssociatedJobCount, | ||
| _state.ResubmittedJobCount, | ||
| _state.ProcessedJobCount, | ||
| Environment.NewLine, | ||
| _state.WorkItemOutcomes.Count, | ||
| _state.ProcessedWorkItemCount, | ||
| _state.ResubmittedWorkItemCount, | ||
| _state.FailedWorkItemCount); | ||
| _state.FailedProcessedWorkItemCount); |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The Helix Job Monitor final summary could diverge from the periodic status line by reporting deduplicated logical outcomes instead of actual processed work items, and it could underreport failures when same-named work items failed in different jobs. This change aligns final summary counts with processed work and preserves per-job failure visibility in final failure reporting.
Summary counting source
Failure aggregation behavior
Regression coverage
Fixes #16988