Skip to content

Correct Helix Job Monitor final-summary work-item accounting#16989

Open
Copilot wants to merge 3 commits into
mainfrom
copilot/fix-final-summary-report-count
Open

Correct Helix Job Monitor final-summary work-item accounting#16989
Copilot wants to merge 3 commits into
mainfrom
copilot/fix-final-summary-report-count

Conversation

Copilot AI commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

The Helix Job Monitor final summary could diverge from the periodic status line by reporting deduplicated logical outcomes instead of actual processed work items, and it could underreport failures when same-named work items failed in different jobs. This change aligns final summary counts with processed work and preserves per-job failure visibility in final failure reporting.

  • Summary counting source

    • Final summary work-item totals now come from processed-job reconciliation counters, not deduplicated outcome keys.
    • Added explicit state for processed and failed processed work-item counts.
  • Failure aggregation behavior

    • Final failed-work-item console aggregation is keyed per Helix job + work item, so same-named failures across different jobs are both retained and reported.
  • Regression coverage

    • Added a focused test that reproduces two jobs failing the same work-item name and verifies:
      • status/final summary counts stay consistent,
      • both failures appear in the final failed-work-item block.
// StatusReporter.LogFinalSummary
"   Work items: {TotalWorkItems} submitted / {ResubmittedWorkItems} resubmitted / {FailedWorkItems} failed",
_state.ProcessedWorkItemCount,
_state.ResubmittedWorkItemCount,
_state.FailedProcessedWorkItemCount

Fixes #16988

Copilot AI requested review from Copilot and removed request for Copilot June 8, 2026 14:04
Copilot AI linked an issue Jun 8, 2026 that may be closed by this pull request
Copilot AI requested review from Copilot and removed request for Copilot June 8, 2026 14:10
Copilot AI requested review from Copilot and removed request for Copilot June 8, 2026 14:14
Copilot AI changed the title [WIP] Fix work item count mismatch in Helix Job Monitor final summary Correct Helix Job Monitor final-summary work-item accounting Jun 8, 2026
Copilot AI requested a review from premun June 8, 2026 14:15
Copilot finished work on behalf of premun June 8, 2026 14:15
@premun premun marked this pull request as ready for review June 8, 2026 14:16
Copilot AI review requested due to automatic review settings June 8, 2026 14:16

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates Helix Job Monitor’s end-of-run reporting so the final summary work-item counts align with the monitor’s processed-job reconciliation (rather than deduplicated logical outcome keys), and so the final aggregated failed-work-item block retains failures for same-named work items across different Helix jobs.

Changes:

  • Final summary work-item totals now use new processed counters (ProcessedWorkItemCount, FailedProcessedWorkItemCount) instead of deduped WorkItemOutcomes.Count.
  • Failed-work-item console aggregation is now keyed per Helix job + work item to avoid collapsing same-named failures across jobs.
  • Adds a regression test covering two same-queue Helix jobs failing the same work item name and asserting both failures appear in the final failure block and counts.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
src/Microsoft.DotNet.Helix/Sdk.Tests/Microsoft.DotNet.Helix.Sdk.Tests/JobMonitorRunnerTests.cs Adds regression coverage for two same-queue jobs failing the same work-item name and validates final reporting.
src/Microsoft.DotNet.Helix/JobMonitor/StatusReporter.cs Switches final summary work-item counts to use processed reconciliation counters.
src/Microsoft.DotNet.Helix/JobMonitor/MonitorState.cs Adds processed work-item counters and changes failed-console info keying to per-job granularity.
src/Microsoft.DotNet.Helix/JobMonitor/JobMonitorRunner.cs Increments new processed work-item counters during completed-job reconciliation and updates failure tracking call-site.
Comments suppressed due to low confidence (1)

src/Microsoft.DotNet.Helix/JobMonitor/MonitorState.cs:220

  • Keying failed-console info by (HelixJobName, WorkItemName) prevents a later retry (new Helix job name) from clearing a failure recorded for the same logical work item in the same submitter chain. That means a resubmitted work item that eventually passes can still show up in the final Failed work item console logs block (and be logged as an AzDO error), contradicting the method/doc intent (“Removal happens when a later incarnation passes”). The failure aggregation needs to preserve per-job visibility and still prune superseded failures when a newer attempt in the same chain passes.
        /// <summary>
        /// Tracks (or removes) the per-failure console-info record for a single observed
        /// work item. Removal happens when a later incarnation passes.
        /// </summary>
        public void TrackFailedWorkItemConsoleInfo(HelixJobInfo helixJob, WorkItemSummary workItem)
        {
            var key = (helixJob.JobName, workItem.Name);
            if (workItem.IsFailed)
            {
                FailedWorkItemConsoleInfo[key] = new FailedWorkItemConsoleInfo(
                    helixJob.DisplayName,
                    workItem.Name,
                    workItem.FormattedState,
                    GetConsoleOutputText(workItem.ConsoleOutputUri));
            }
            else
            {
                FailedWorkItemConsoleInfo.Remove(key);
            }

Comment on lines +59 to 65
/// Latest known console-link information for every failed work item attempt, keyed by
/// (JobName, WorkItemName) where <c>JobName</c> is the Helix job name. Used to build
/// the final aggregated failure report
/// without collapsing failures across different jobs.
/// </summary>
public Dictionary<(string ChainKey, string WorkItemName), FailedWorkItemConsoleInfo> FailedWorkItemConsoleInfo { get; }
public Dictionary<(string JobName, string WorkItemName), FailedWorkItemConsoleInfo> FailedWorkItemConsoleInfo { get; }
= new(WorkItemOutcomeKeyComparer.Instance);
Comment on lines 147 to +155
+ " Work items: {TotalWorkItems} submitted / {ResubmittedWorkItems} resubmitted / {FailedWorkItems} failed",
Environment.NewLine,
totalAssociatedJobCount,
_state.ResubmittedJobCount,
_state.ProcessedJobCount,
Environment.NewLine,
_state.WorkItemOutcomes.Count,
_state.ProcessedWorkItemCount,
_state.ResubmittedWorkItemCount,
_state.FailedWorkItemCount);
_state.FailedProcessedWorkItemCount);
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Helix Job Monitor final summary reports incorrect work item counts

3 participants