Skip to content

[aw-failures] [aw] Failure Investigator (6h) - Issue Group #26930

@github-actions

Description

@github-actions

[aw] Failure Investigator (6h)

Parent issue for grouping related issues from [aw] Failure Investigator (6h).

Sub-issues are automatically linked below (max 64 per parent).

Workflow: [aw] Failure Investigator (6h)

  • expires on Apr 25, 2026, 7:18 PM UTC


[aw-fi] 6h Analysis: 2026-04-18 01:10 UTC

Executive Summary

6 failures across 29 runs (79% success) in the window Apr 17 19:10 – Apr 18 01:10 UTC. Three distinct failure clusters: codex 401 auth (2 runs), Copilot CLI 15-min timeout on closed PR (2 runs), and Copilot shell permission denied for safeoutputs (2 runs). One tracking issue closed (Auto-Triage Issues, now fixed). One new sub-issue created for the previously undiagnosed shell permission pattern.

Failure Clusters

Cluster Runs Engine Existing Issue Priority
Codex 401 auth (OPENAI_API_KEY invalid) §24590101486, §24591527202 codex #26929, #26911, #26958 P0
Copilot CLI 15-min timeout (closed PR branch) §24590162684, §24591525620 copilot #26909, #26931 P1
Copilot shell permission denied for safeoutputs noop §24590461053, §24591062937 copilot #26955, #26964 P1

Evidence Highlights

Cluster 1: Codex 401 (AI Moderator + Daily Observability Report)
ERROR: Reconnecting... 5/5
ERROR: unexpected status 401 Unauthorized: Missing bearer or basic authentication in header,
  url: (api.openai.com/redacted),
  cf-ray: 9edf09437f8dced7-SJC

Root cause: OPENAI_API_KEY secret is invalid or expired. Both workflows hit api.openai.com/v1/responses, exhaust 5 reconnect retries, and abort. Firewall allows api.openai.com:443 — the key itself is rejected.

Cluster 2: Copilot CLI Timeout (Test Quality Sentinel × 2)

Both failures were PR-triggered. PR #26945 (copilot/resolve-mcpserverconfig-naming-conflicts) was merged/closed before the workflow ran. The checkout step correctly detected the closed PR (i️ PR #26945 is now closed — treating checkout failure as expected), but the Copilot CLI then ran for the full 15-minute limit and timed out:

##[error]The action 'Execute GitHub Copilot CLI' has timed out after 15 minutes.
Set output 'agentic_engine_timeout'

Note: TQS succeeded on 6 other runs in the same window — this failure is specific to closed-PR branch checkout scenarios.

Cluster 3: Copilot Shell Permission Denied (Daily Safe Output Integrator + Daily Project Performance)

The agent completed its analysis correctly (all 41 safe-output types confirmed covered) and tried to call safeoutputs noop via bash — but the Copilot CLI blocked every shell invocation:

✗ safeoutputs noop --message "..."
  └ Permission denied and could not request permission from user

✗ /home/runner/work/_temp/gh-aw/mcp-cli/bin/safeoutputs noop --message "..."
  └ Permission denied and could not request permission from user

After 8+ failed attempts (shell, node bridge, MCP HTTP, Python HTTP), the Copilot CLI timed out (20–32 min). This is distinct from #26931 (MCP server connections blocked by org policy) — MCP connections were healthy, only the shell tool invocation was blocked.

Existing Issue Correlation

Sub-Issues Created

Proposed Fix Roadmap

Priority Action Owner
P0 Rotate OPENAI_API_KEY — codex engine fails 100% until fixed Admin
P1 Instruct Copilot agent to use MCP noop tool, not bash CLI Workflow author
P1 Handle Copilot CLI graceful exit when PR branch deleted AWF team

References:

Generated by [aw] Failure Investigator (6h) · ● 590.5K ·

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions