feat: surface MCPG backend failures in pipeline logs

## Problem

MCPG backend failures are **completely silent** in the pipeline. When a stdio MCP backend fails to start (e.g., `npx` timeout due to blocked network), the pipeline reports success but the agent simply can't see its MCP tools. The only evidence is buried in MCPG stderr logs which (until recently) weren't even published as artifacts.

### What happened

1. MCPG starts successfully — `/health` returns 200
2. MCPG registers routes for all configured backends (azure-devops, safeoutputs)
3. Agent runs, tries to call an azure-devops tool
4. MCPG lazily launches `docker run ... node:20-slim npx -y @azure-devops/mcp ...`
5. `npx` can't reach npm registry (AWF iptables blocks the container)
6. MCPG times out after 30s, returns error to agent
7. Agent silently falls back to working without the tool
8. Pipeline step "succeeds" — no one knows the MCP was broken

### Root cause

MCPG backends are **lazily launched** on first tool call. The `/health` endpoint at startup shows all servers as "stopped" (not yet started), which is normal. The actual failure only occurs when the agent calls a tool, and it's not surfaced in the pipeline log.

## Proposed solutions

### 1. Post-startup backend warm-up step (ado-aw — recommended)

After MCPG health check passes, add a pipeline step that eagerly probes each configured MCP backend via MCPG to force lazy initialization:

```bash
# Probe each backend to force lazy launch and detect failures early
for server in $(jq -r '.mcpServers | keys[]' "$GATEWAY_OUTPUT"); do
  echo "Probing MCP backend: $server"
  RESPONSE=$(curl -sf -X POST \
    -H "Authorization: Bearer $MCP_GATEWAY_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{"jsonrpc":"2.0","id":1,"method":"tools/list"}' \
    "http://localhost:80/mcp/$server" 2>&1) || {
    echo "##vso[task.logissue type=warning]MCP backend '$server' failed to initialize"
    echo "Response: $RESPONSE"
  }
done
```

This catches failures **before** the agent runs, making them visible in pipeline logs with an ADO warning annotation.

### 2. Post-agent MCPG health check (ado-aw)

After the agent completes, query MCPG `/health` and check per-server status. MCPG's health endpoint reports per-server state including `"error"` with `lastError` messages:

```bash
HEALTH=$(curl -sf "http://localhost:80/health")
echo "$HEALTH" | jq .
UNHEALTHY=$(echo "$HEALTH" | jq -r '.servers | to_entries[] | select(.value.status == "error") | .key')
if [ -n "$UNHEALTHY" ]; then
  echo "##vso[task.logissue type=warning]MCPG backends failed: $UNHEALTHY"
fi
```

### 3. MCPG eager launch mode (upstream — gh-aw-mcpg)

Request an upstream feature: a config flag (e.g., `gateway.eagerLaunch: true`) that makes MCPG launch all stdio backends at startup time instead of lazily. This would surface failures immediately in the health check, before the agent runs. Currently MCPG's `Launcher.GetOrLaunch()` is purely lazy — backends are only started when the first tool call arrives.

### 4. Compile-time ecosystem validation (ado-aw)

At compile time, warn when a containerized MCP uses `npx` but the corresponding ecosystem (node) isn't in the network allowlist. This is a static analysis check that catches the misconfiguration before it reaches the pipeline:

```
Warning: MCP server 'azure-devops' uses container 'node:20-slim' with entrypoint 'npx',
but 'node' ecosystem is not in network.allowed. npx requires npm registry access.
Consider adding 'node' to network.allowed or using a pre-built container image.
```

## Recommendation

Implement **#1** (warm-up probe) as the primary fix — it's the most reliable and doesn't require upstream changes. **#4** (compile-time check) is a nice defense-in-depth addition. **#3** would be ideal long-term but requires coordination with gh-aw-mcpg.

## Context

- The immediate trigger (missing node ecosystem domains) was fixed in PR #252 by adding `node` ecosystem to ADO MCP extension's `required_hosts()`
- MCPG logs are now published as artifacts (also in PR #252)
- MCPG stderr is now streamed live via `tee` (also in PR #252)
- MCPG health endpoint (`/health`) already supports per-server status: `running`, `stopped`, `error` with `lastError` field


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: surface MCPG backend failures in pipeline logs #254

Problem

What happened

Root cause

Proposed solutions

1. Post-startup backend warm-up step (ado-aw — recommended)

2. Post-agent MCPG health check (ado-aw)

3. MCPG eager launch mode (upstream — gh-aw-mcpg)

4. Compile-time ecosystem validation (ado-aw)

Recommendation

Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feat: surface MCPG backend failures in pipeline logs #254

Description

Problem

What happened

Root cause

Proposed solutions

1. Post-startup backend warm-up step (ado-aw — recommended)

2. Post-agent MCPG health check (ado-aw)

3. MCPG eager launch mode (upstream — gh-aw-mcpg)

4. Compile-time ecosystem validation (ado-aw)

Recommendation

Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions