Skip to content

feat: surface MCPG backend failures in pipeline logs #254

@jamesadevine

Description

@jamesadevine

Problem

MCPG backend failures are completely silent in the pipeline. When a stdio MCP backend fails to start (e.g., npx timeout due to blocked network), the pipeline reports success but the agent simply can't see its MCP tools. The only evidence is buried in MCPG stderr logs which (until recently) weren't even published as artifacts.

What happened

  1. MCPG starts successfully — /health returns 200
  2. MCPG registers routes for all configured backends (azure-devops, safeoutputs)
  3. Agent runs, tries to call an azure-devops tool
  4. MCPG lazily launches docker run ... node:20-slim npx -y @azure-devops/mcp ...
  5. npx can't reach npm registry (AWF iptables blocks the container)
  6. MCPG times out after 30s, returns error to agent
  7. Agent silently falls back to working without the tool
  8. Pipeline step "succeeds" — no one knows the MCP was broken

Root cause

MCPG backends are lazily launched on first tool call. The /health endpoint at startup shows all servers as "stopped" (not yet started), which is normal. The actual failure only occurs when the agent calls a tool, and it's not surfaced in the pipeline log.

Proposed solutions

1. Post-startup backend warm-up step (ado-aw — recommended)

After MCPG health check passes, add a pipeline step that eagerly probes each configured MCP backend via MCPG to force lazy initialization:

# Probe each backend to force lazy launch and detect failures early
for server in $(jq -r '.mcpServers | keys[]' "$GATEWAY_OUTPUT"); do
  echo "Probing MCP backend: $server"
  RESPONSE=$(curl -sf -X POST \
    -H "Authorization: Bearer $MCP_GATEWAY_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{"jsonrpc":"2.0","id":1,"method":"tools/list"}' \
    "http://localhost:80/mcp/$server" 2>&1) || {
    echo "##vso[task.logissue type=warning]MCP backend '$server' failed to initialize"
    echo "Response: $RESPONSE"
  }
done

This catches failures before the agent runs, making them visible in pipeline logs with an ADO warning annotation.

2. Post-agent MCPG health check (ado-aw)

After the agent completes, query MCPG /health and check per-server status. MCPG's health endpoint reports per-server state including "error" with lastError messages:

HEALTH=$(curl -sf "http://localhost:80/health")
echo "$HEALTH" | jq .
UNHEALTHY=$(echo "$HEALTH" | jq -r '.servers | to_entries[] | select(.value.status == "error") | .key')
if [ -n "$UNHEALTHY" ]; then
  echo "##vso[task.logissue type=warning]MCPG backends failed: $UNHEALTHY"
fi

3. MCPG eager launch mode (upstream — gh-aw-mcpg)

Request an upstream feature: a config flag (e.g., gateway.eagerLaunch: true) that makes MCPG launch all stdio backends at startup time instead of lazily. This would surface failures immediately in the health check, before the agent runs. Currently MCPG's Launcher.GetOrLaunch() is purely lazy — backends are only started when the first tool call arrives.

4. Compile-time ecosystem validation (ado-aw)

At compile time, warn when a containerized MCP uses npx but the corresponding ecosystem (node) isn't in the network allowlist. This is a static analysis check that catches the misconfiguration before it reaches the pipeline:

Warning: MCP server 'azure-devops' uses container 'node:20-slim' with entrypoint 'npx',
but 'node' ecosystem is not in network.allowed. npx requires npm registry access.
Consider adding 'node' to network.allowed or using a pre-built container image.

Recommendation

Implement #1 (warm-up probe) as the primary fix — it's the most reliable and doesn't require upstream changes. #4 (compile-time check) is a nice defense-in-depth addition. #3 would be ideal long-term but requires coordination with gh-aw-mcpg.

Context

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions