Skip to content

feat(sre-agent): add FinOps recipe content#2168

Closed
MSBrett wants to merge 23 commits into
devfrom
features/sre-agent-recipe
Closed

feat(sre-agent): add FinOps recipe content#2168
MSBrett wants to merge 23 commits into
devfrom
features/sre-agent-recipe

Conversation

@MSBrett

@MSBrett MSBrett commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Replaces part of #2111.

Scope:

  • Adds the FinOps Hub SRE Agent recipe content, expected config, scheduled tasks, tools, skills, knowledge, subagents, and azcapman submodule reference.
  • Includes recipe verification/build support and the SRE Agent deploy unit test.

Review notes:

  • Base: features/sre-kpi-query-catalog.
  • Depends on the KPI query catalog PR because the recipe consumes that catalog.
  • Split plan: memory://projects/finops-toolkit/pr-2111-split-plan.

Verification:

@microsoft-github-policy-service microsoft-github-policy-service Bot added the Needs: Review 👀 PR that is ready to be reviewed label Jun 3, 2026
@MSBrett MSBrett marked this pull request as ready for review June 3, 2026 15:54
@MSBrett MSBrett requested a review from flanakin as a code owner June 3, 2026 15:54

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds the FinOps Hub Azure SRE Agent “recipe” content under src/templates/sre-agent/, including subagent definitions, tool manifests (Kusto + Python), skills/reference material, knowledge files, scheduled tasks, and verification/config artifacts intended to support deployment and ongoing ops workflows.

Changes:

  • Introduces the recipes/finops-hub/ recipe (agents, tools, skills, scheduled tasks, connectors, knowledge) for the FinOps Toolkit SRE Agent.
  • Adds verification/config artifacts (expected-config.json, roles.yaml, agent.json, built-in tool overrides) to validate deployments and expected inventory.
  • Adds template guardrails and repository wiring (AGENTS guidance, upstream pin, git attributes/ignore, submodule reference).

Reviewed changes

Copilot reviewed 91 out of 93 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/templates/sre-agent/recipes/finops-hub/roles.yaml Declares managed identity RBAC requirements for subscription/target RGs and ADX.
src/templates/sre-agent/recipes/finops-hub/README.md Recipe overview and deploy usage.
src/templates/sre-agent/recipes/finops-hub/knowledge/teams-notification-guide.md Delivery guidance for Teams/Outlook connector tools.
src/templates/sre-agent/recipes/finops-hub/knowledge/onboarding-recommendations.md Onboarding and connector setup recommendations.
src/templates/sre-agent/recipes/finops-hub/knowledge/document-index.md Knowledge source inventory and verification sentinel.
src/templates/sre-agent/recipes/finops-hub/knowledge/chart-artifact-verification.md Non-visual chart artifact verification guidance.
src/templates/sre-agent/recipes/finops-hub/expected-config.json Expected inventory/config used for verification comparisons.
src/templates/sre-agent/recipes/finops-hub/connectors.json Defines the FinOps Hub Kusto connector for the recipe.
src/templates/sre-agent/recipes/finops-hub/config/tools/vm-quota-usage.yaml Python tool: VM quota usage reporting via ARM REST.
src/templates/sre-agent/recipes/finops-hub/config/tools/suppress-advisor-recommendations.yaml Python tool: bulk Advisor suppression creation via Resource Graph + ARM PUT.
src/templates/sre-agent/recipes/finops-hub/config/tools/resource-graph-query.yaml Python tool: run Resource Graph queries across subscriptions.
src/templates/sre-agent/recipes/finops-hub/config/tools/deploy-bulk-anomaly-alerts.yaml Python tool: deploy Cost Mgmt anomaly alerts across MG subscriptions.
src/templates/sre-agent/recipes/finops-hub/config/tools/deploy-budget.yaml Python tool: create/update subscription budgets via ARM REST.
src/templates/sre-agent/recipes/finops-hub/config/tools/deploy-anomaly-alert.yaml Python tool: create/update anomaly scheduled actions per subscription.
src/templates/sre-agent/recipes/finops-hub/config/tools/capacity-reservation-groups.yaml Python tool: CRG inventory and utilization reporting.
src/templates/sre-agent/recipes/finops-hub/config/tools/benefit-recommendations.yaml Python tool: Cost Mgmt benefit recommendations at billing scope.
src/templates/sre-agent/recipes/finops-hub/config/subagents/ftk-hubs-agent.yaml Subagent definition for hub deployment/ops workflows plus tool/hook config.
src/templates/sre-agent/recipes/finops-hub/config/subagents/ftk-database-query.yaml Subagent definition for Kusto/catalog-based analytics plus tool/hook config.
src/templates/sre-agent/recipes/finops-hub/config/subagents/finops-practitioner.yaml Orchestrator subagent definition and delegation rules.
src/templates/sre-agent/recipes/finops-hub/config/subagents/chief-financial-officer.yaml CFO subagent definition (KB-only) for exec framing.
src/templates/sre-agent/recipes/finops-hub/config/subagents/azure-capacity-manager.yaml Capacity subagent definition plus tool/hook config.
src/templates/sre-agent/recipes/finops-hub/config/skills/finops-toolkit/references/workflows/ftk-hubs-healthCheck.md Skill reference workflow: hub health checks.
src/templates/sre-agent/recipes/finops-hub/config/skills/finops-toolkit/references/workflows/ftk-hubs-connect.md Skill reference workflow: connect/discover hub and persist env.
src/templates/sre-agent/recipes/finops-hub/config/skills/finops-toolkit/references/understand-finops-hub-context.md Foundational grounding step for hub context before analysis.
src/templates/sre-agent/recipes/finops-hub/config/skills/finops-toolkit/references/top-cost-drivers.md Skill reference for ranking/driver analysis patterns.
src/templates/sre-agent/recipes/finops-hub/config/skills/finops-toolkit/references/settings-format.md Skill reference for .ftk/environments.local.md schema.
src/templates/sre-agent/recipes/finops-hub/config/skills/finops-toolkit/references/finops-hubs.md Skill knowledge: query catalog usage, constraints, best practices.
src/templates/sre-agent/recipes/finops-hub/config/skills/finops-toolkit/references/custom-dimension-analysis.md Skill reference for allocation/tag dimension analysis.
src/templates/sre-agent/recipes/finops-hub/config/skills/finops-toolkit/references/cost-trend-analysis.md Skill reference for trend analysis patterns.
src/templates/sre-agent/recipes/finops-hub/config/skills/finops-toolkit/references/cost-spike-investigation.md Skill reference for spike root-cause patterns.
src/templates/sre-agent/recipes/finops-hub/config/skills/finops-toolkit/references/cost-anomaly-detection.md Skill reference for anomaly detection decomposition.
src/templates/sre-agent/recipes/finops-hub/config/skills/finops-toolkit/README.md Skill README describing activation and query catalog usage.
src/templates/sre-agent/recipes/finops-hub/config/skills/azure-cost-management/references/Get-BenefitRecommendations.ps1 Reference script for benefit recommendations via Az REST.
src/templates/sre-agent/recipes/finops-hub/config/skills/azure-cost-management/references/azure-macc.md Cost Mgmt knowledge: MACC tracking and workflows.
src/templates/sre-agent/recipes/finops-hub/config/skills/azure-cost-management/references/azure-credits.md Cost Mgmt knowledge: Azure credits/prepayment workflows.
src/templates/sre-agent/recipes/finops-hub/config/skills/azure-cost-management/references/azure-cost-exports.md Cost Mgmt knowledge: exports config/backfill guidance.
src/templates/sre-agent/recipes/finops-hub/config/skills/azure-cost-management/references/azure-budgets.md Cost Mgmt knowledge: budget creation/notifications/action groups.
src/templates/sre-agent/recipes/finops-hub/config/skills/azure-cost-management/README.md Azure cost management skill README.
src/templates/sre-agent/recipes/finops-hub/config/built-in-tools.json Enables/overrides built-in visualization and log query tools.
src/templates/sre-agent/recipes/finops-hub/automations/scheduled-tasks/storage-paas-growth-forecast.yaml Scheduled task: storage/PaaS growth forecast.
src/templates/sre-agent/recipes/finops-hub/automations/scheduled-tasks/sku-availability-audit.yaml Scheduled task: SKU availability audit.
src/templates/sre-agent/recipes/finops-hub/automations/scheduled-tasks/non-compute-quota-audit.yaml Scheduled task: non-compute quota audit.
src/templates/sre-agent/recipes/finops-hub/automations/scheduled-tasks/monitoring-scope-validation.yaml Scheduled task: monitoring scope validation.
src/templates/sre-agent/recipes/finops-hub/automations/scheduled-tasks/db-quota-audit.yaml Scheduled task: DB quota audit.
src/templates/sre-agent/recipes/finops-hub/automations/scheduled-tasks/compute-utilization-trend.yaml Scheduled task: compute utilization trend.
src/templates/sre-agent/recipes/finops-hub/automations/scheduled-tasks/capacity-daily-monitor.yaml Scheduled task: daily capacity monitor.
src/templates/sre-agent/recipes/finops-hub/automations/scheduled-tasks/budget-coverage-audit.yaml Scheduled task: budget coverage audit.
src/templates/sre-agent/recipes/finops-hub/automations/scheduled-tasks/benefit-recommendation-review.yaml Scheduled task: benefit recommendation executive review.
src/templates/sre-agent/recipes/finops-hub/automations/scheduled-tasks/alert-coverage-audit.yaml Scheduled task: anomaly alert coverage audit.
src/templates/sre-agent/recipes/finops-hub/automations/scheduled-tasks/advisor-suppression-review.yaml Scheduled task: Advisor suppression review.
src/templates/sre-agent/recipes/finops-hub/agent.json Default agent settings (access/action mode, provider, toggles).
src/templates/sre-agent/AGENTS.md Guardrails and template inventory for contributors/agents.
src/templates/sre-agent/.upstream-pin Upstream pin metadata for the starter-lab template.
src/templates/sre-agent/.gitignore Ignores build gate artifacts and un-ignores bin/.
src/templates/sre-agent/.gitattributes Forces LF EOL for shell scripts.
.gitmodules Adds azcapman submodule reference under the SRE agent template.

Comment thread src/templates/sre-agent/recipes/finops-hub/roles.yaml
Comment thread src/templates/sre-agent/recipes/finops-hub/expected-config.json
Comment thread src/templates/sre-agent/recipes/finops-hub/config/subagents/ftk-hubs-agent.yaml Outdated
Comment thread src/templates/sre-agent/AGENTS.md
@MSBrett MSBrett enabled auto-merge (squash) June 4, 2026 13:19
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@RolandKrummenacher RolandKrummenacher left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the SRE Agent recipe across three areas — deployable IaC/RBAC, shell/Python/build tooling, and repo structure/packaging/content. The recipe is well-engineered overall (read-only Kusto AllDatabasesViewer scoped to one cluster, system-assigned identity, secure App Insights via reference(), agent-scoped admin role, hermetic Pester tests, yaml.safe_load, parameterized JMESPath/jq, accurate 37-query count). But there are a few merge-blockers, mostly around insecure-by-default deployment and a fragile submodule. Details inline.

High / blocking

  1. Portal one-click deploy is insecure by default: accessLevel='High' (Contributor on the agent RG + every targetResourceGroups) + actionMode='autonomous' + all 19 shipped scheduled tasks agent_mode: autonomous and enabled → an unattended, write-capable agent with a fleet of self-triggering cron jobs on first click. The CLI path correctly defaults to Low/review; the portal path deliberately escalates both.
  2. The deployment script downloads the recipe package and executes its contents (connectors, Python tools, skills, subagents, autonomous scheduled tasks) with no integrity check, from an overridable recipePackageUri.
  3. Git submodule azcapman is uninitialized, unpinned (no branch), and the capacity skill's files are symlinks into it — the build fails / silently drops the capacity skill without git submodule update --init, and every toolkit consumer/clone/source-zip now inherits the submodule.

Medium
4. telemetry.sh hardcodes an App Insights ikey but is dead code (never sourced), while the README's documented --no-telemetry / SRE_AGENT_NO_TELEMETRY opt-out is a silent no-op.
5. yoy-report.yaml hardcodes a July–June fiscal year (same class as the sibling plugin PR #2167).

Also (no inline anchor — binary): docs/deploy/sre-agent/{14.0,latest}/sre-agent-recipe.zip commits a 346 KB build artifact twice (byte-identical). No other docs/deploy path on dev commits a binary zip; binaries don't diff/review and the two copies will drift. Prefer generating it at release, or add a CI check that both copies match the source recipe + mark it binary in .gitattributes.

(For reference, the sibling PR #2167 carries the plugin versions of much of this skill/agent content; the fiscal-year item there and here should be resolved consistently.)

Comment thread src/templates/sre-agent/main.bicep Outdated

@description('Agent access level.')
@allowed(['Low', 'High'])
param accessLevel string = 'High'

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Insecure-by-default portal deployment. The portal entry point defaults accessLevel='High' (line 30) and actionMode='autonomous' (line 34). High grants the agent's identity Contributor (b24988ac) on the agent RG and every targetResourceGroups (infra/modules/resource-group-rbac.bicep), and all 19 shipped scheduled tasks are agent_mode: autonomous and enabled. So a one-click portal deploy yields an unattended, write-capable agent running a fleet of self-triggering cron jobs immediately.

The CLI path (infra/main.bicep:24,28) correctly defaults to Low + review; only this portal wrapper escalates. Recommend defaulting the portal to Low/review too, and/or gating High+autonomous behind an explicit acknowledgement in createUiDefinition.json, and shipping the scheduled tasks disabled (or in review mode) so the operator opts into autonomy.


Write-Output "Downloading SRE Agent recipe package: $recipePackageUri"
Invoke-WithRetry -Label 'download recipe package' -Action {
Invoke-WebRequest -Uri $recipePackageUri -OutFile $zipPath

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recipe package is downloaded and its contents executed with no integrity verification. recipePackageUri (line 122) comes from an env var, is fetched via Invoke-WebRequest (line 139), Expand-Archived (line 141), and its extras.json is pushed to the agent — connectors, Python tools, skills, subagents, and autonomous scheduled tasks — which then run with the agent's (Contributor, when High) identity. The default is templateLink-relative (trustworthy from the official template), but the parameter is fully overridable and there's no SHA pin/signature. Anyone who can set recipePackageUri (or MITM a non-pinned host) can inject tools/subagents/cron tasks. Recommend pinning an expected SHA-256 and verifying after download, and restricting the URI to https:// on a known-host allowlist. (recipePackageUri is correctly not exposed in createUiDefinition.json, which limits portal tampering — good — but ARM/CLI callers can still override it.)

Comment thread .gitmodules Outdated
@@ -0,0 +1,3 @@
[submodule "src/templates/sre-agent/submodules/azcapman"]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Submodule makes the recipe build non-hermetic and is currently broken. azcapman is pinned only by gitlink SHA (no branch), is uninitialized, and three files under recipes/finops-hub/config/skills/azure-capacity-management/ (SKILL.md, references/docs, references/scripts) are symlinks into it. build-extras.py raises Expected skill directory missing SKILL.md or has a broken symlink: azure-capacity-management, so Build-SreAgentTemplate.ps1 fails unless git submodule update --init was run first — and no build doc/README states that prerequisite. Adding a submodule also burdens every toolkit clone and any git archive/source zip (empty submodule → broken symlinks). Recommend vendoring the azcapman skill/docs/scripts directly into the template and dropping the submodule + symlinks; if the submodule must stay, pin a branch and add an enforced, documented submodule update --init step in the build.

# source "$(dirname "$0")/telemetry.sh"
# send_telemetry "deploy" "finops-hub" "westus3" "true" "false" "true" "false" "deploy"

_TELEMETRY_IKEY="f10eff7f-b995-4c41-8347-90f0f55d5969"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two issues. (1) This sender is dead codetelemetry.sh is never sourced/executed by any script, yet it's shipped with a hardcoded App Insights ikey and a raw POST to the ingestion endpoint. Meanwhile the README (line 219) tells users to export SRE_AGENT_NO_TELEMETRY=1 and deploy.sh accepts --no-telemetry, but both are no-ops (deploy.sh just shifts the flag "for compatibility"), so the documented opt-out controls nothing. (2) It diverges from the toolkit's telemetry convention — every other template uses the Bicep enableDefaultTelemetry deployment (ARM-visible, parameter opt-out). Recommend either deleting telemetry.sh (and the misleading README/flag), or wiring it through the standard Bicep telemetry mechanism and actually honoring the opt-out. (Hardcoding an App Insights ingestion key in a public repo is itself acceptable — they're write-only — the problems are the dead code + misleading docs + divergence.)

Load your finops-toolkit and azure-cost-management skills. Lead this as the FinOps practitioner. Delegate all FinOps Hub Kusto evidence collection to `ftk-database-query`, delegate capacity-risk evidence to `azure-capacity-manager`, and consult `chief-financial-officer` for fiscal planning, executive framing, and commitment-risk recommendations.


Our fiscal year runs July through June. This task runs on January 5 and July 5. On January 5, compare the completed July-December first-half period against the same period in the prior fiscal year and forecast through June 30. On July 5, compare the just-completed July-June fiscal year against the previous fiscal year and prepare the next fiscal year planning view.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded July–June fiscal year baked into the task prompt ("Our fiscal year runs July through June... forecast through June 30"; also lines 87, 92, 145, 165). That's Microsoft's FY, not the customer's — for a calendar-FY org this produces the wrong comparison windows and forecast dates. Same class of issue flagged on the sibling plugin PR #2167's ftk-ytd-report; recommend resolving both consistently (parameterize fiscal-year start/end or clearly mark it as a required customization point).

@microsoft-github-policy-service microsoft-github-policy-service Bot added Needs: Attention 👋 Issue or PR needs to be reviewed by the author or it will be closed due to no activity and removed Needs: Review 👀 PR that is ready to be reviewed labels Jun 17, 2026
msbrett and others added 2 commits June 17, 2026 09:28
# Conflicts:
#	docs-mslearn/toolkit/changelog.md
#	src/queries/INDEX.md
#	src/queries/KPI.md
#	src/queries/catalog/ai-cost-by-application.kql
#	src/queries/catalog/ai-daily-trend.kql
#	src/queries/catalog/ai-model-cost-comparison.kql
#	src/queries/catalog/ai-token-usage-breakdown.kql
#	src/queries/catalog/allocation-accuracy-index.kql
#	src/queries/catalog/anomaly-detection-rate.kql
#	src/queries/catalog/anomaly-variance-total.kql
#	src/queries/catalog/commitment-discount-waste.kql
#	src/queries/catalog/commitment-utilization-score.kql
#	src/queries/catalog/compute-cost-per-core.kql
#	src/queries/catalog/compute-spend-commitment-coverage.kql
#	src/queries/catalog/cost-optimization-index.kql
#	src/queries/catalog/cost-per-gb-stored.kql
#	src/queries/catalog/cost-visibility-delay.kql
#	src/queries/catalog/data-update-frequency.kql
#	src/queries/catalog/macc-consumption-vs-commitment.kql
#	src/queries/catalog/percentage-unallocated-costs.kql
#	src/queries/catalog/percentage-untagged-costs.kql
#	src/queries/catalog/storage-tier-distribution.kql
#	src/queries/catalog/tagging-policy-compliance.kql
#	src/queries/finops-hub-database-guide.md
…omizable fiscal year

- Vendor azure-capacity-management skill as real files and drop the azcapman
  git submodule + symlinks so a clean clone builds all 3 skills (Roland H3).
- Default the recipe agent to Low (read-only) access and autonomous reporting,
  and stop granting subscription-wide Reader by default (Roland H1); reports run
  unattended on a least-privilege identity.
- Mark the yoy-report July-June fiscal year as a documented, customizable example
  instead of a silent assumption (Roland M5).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@microsoft-github-policy-service microsoft-github-policy-service Bot added Needs: Review 👀 PR that is ready to be reviewed and removed Needs: Attention 👋 Issue or PR needs to be reviewed by the author or it will be closed due to no activity labels Jun 17, 2026
@MSBrett MSBrett changed the base branch from features/sre-kpi-query-catalog to dev June 17, 2026 17:26
@MSBrett MSBrett marked this pull request as draft June 17, 2026 17:41
auto-merge was automatically disabled June 17, 2026 17:41

Pull request was converted to draft

…artifacts

Collapse the recipe and deploy slices into one coherent template PR (the recipe
is a subdirectory of the template and is packaged into the deploy artifacts, so
they cannot be split into independently-buildable PRs). Brings the deploy-side
security fixes onto the unified template and regenerates the committed artifacts:

- Secure-by-default: portal/CLI accessLevel=Low (read-only) + autonomous reporting;
  subscription Reader off by default (Roland H1).
- Recipe package integrity: SHA-256 (fail-closed) + https host allowlist before
  Expand-Archive; hash now injected into azuredeploy.json at package time (Roland H2).
- Removed dead bin/telemetry.sh + hardcoded App Insights key + no-op opt-out (Roland M4).
- Parameterized deployer principalType for CI/CD OIDC service-principal deploys.
- Regenerated docs/deploy/sre-agent/{14.0,latest} azuredeploy.json + createUiDefinition
  + recipe zip from scrubbed queries and vendored capacity skill: zips are leak-free
  and the integrity hash matches. All 39 SRE deploy tests pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@MSBrett MSBrett marked this pull request as ready for review June 17, 2026 18:21
MSBrett pushed a commit that referenced this pull request Jun 17, 2026
- Remove broken yaml-to-deck GitHub links (lines 82, 166)
  - Removed PowerPoint template reference (non-existent path)
  - Removed lint.py reference (non-existent path)
  - Kept Power BI theme reference (actual file exists)

- Update ms.date to 06/17/2026 in 31 docs-mslearn files
  (was 06/05/2026, stale by 12 days)

- Genericize brand.md: replace SRE Agent examples with product-neutral patterns
  - Line 113: Changed 'FinOps toolkit SRE Agent' to generic placeholder
  - Line 135: Replaced SRE-specific examples with <Product> placeholders
  - Line 159: Changed anaphora example from 'Azure SRE Agent' to generic 'Azure Data Explorer'
  - Line 184-192: Replaced all SRE Agent examples in page title table with Azure Data Explorer
  - Line 200-204: Updated anaphora short form table (removed SRE Agent, added Power BI)
  - Line 206: Generic 'product vs component' instead of 'agent vs subagent'
  - Line 210: Replaced SRE Agent in pattern list with generic products
  - Line 216: Changed 'subagents' to 'agents' (generic term)
  - Line 124: Changed URL example from SRE Agent to Azure Data Explorer
  - Line 240: Generic placeholder in external contributor guidance

Brand guidance now provides product-neutral patterns reusable across
all FinOps toolkit integrations, not coupled to SRE Agent work (PR #2168).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
MSBrett and others added 2 commits June 17, 2026 13:09
…enerated zips

Set ms.date to today for changed docs, change portal default actionMode to review for safer defaults, and remove build-generated recipe zips with a gitignore entry.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Align Pester assertions with the safer portal default (actionMode = review) while keeping CLI, recipe, and expected-config defaults at autonomous.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@RolandKrummenacher RolandKrummenacher left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving on content — all review findings are resolved:

  • ✅ Portal deploy defaults now accessLevel=Low + actionMode=review (secure by default; test added)
  • azcapman capacity skill vendored as real files; submodule + .gitmodules removed
  • ✅ Recipe package integrity: SHA256 verification added and properly wired (Package-Toolkit.ps1 injects the real hash at packaging time, fail-closed)
  • telemetry.sh removed
  • ✅ Generated zips dropped from docs/deploy
  • ✅ Fiscal year documented as a configurable worked-example

Two merge-logistics items to handle before merging (not code issues):

  1. Base is features/sre-kpi-query-catalog, which was squash-merged via #2166 — retarget the base to dev and rebase (use rebase --onto origin/dev origin/features/sre-kpi-query-catalog … to avoid src/queries conflicts from the squash).
  2. This branch still contains the full deploy slice that #2169 also carries; decide whether #2169 is consolidated here or rebased on top of this once landed (it's currently stale/broken vs this branch). Approving the content; please sort the base/stack before merge.

@flanakin flanakin left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave up reviewing this. There's too much. There are a few big blockers, more questions, etc.

Comment thread .gitignore
release/scloud-occurrence-report.md

# Generated SRE Agent recipe packages
docs/deploy/sre-agent/*/sre-agent-recipe.zip

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Why are these in the deploy folder if they're not getting committed? Shouldn't we clean them up so they don't even land there?

Comment on lines +133 to +137
{
"source_path_from_root": "/finops/finops/toolkit/hubs/configure-sre.md",
"redirect_url": "/cloud-computing/finops/toolkit/sre-agent/overview",
"redirect_document_id": false
},

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need a redirect for something that never existed before

Suggested change
{
"source_path_from_root": "/finops/finops/toolkit/hubs/configure-sre.md",
"redirect_url": "/cloud-computing/finops/toolkit/sre-agent/overview",
"redirect_document_id": false
},

Comment thread docs-mslearn/TOC.yml
href: toolkit/hubs/upgrade.md
- name: Compatibility guide
href: toolkit/hubs/compatibility.md
- name: FinOps toolkit SRE Agent

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would put this after Workbooks, based on popularity. I know it's new, but that's the general practice we've always have. I could see higher than Alerts and AOE, but probably not Workbooks. We can always move it based on usage.

Comment thread docs-mslearn/TOC.yml
href: toolkit/hubs/upgrade.md
- name: Compatibility guide
href: toolkit/hubs/compatibility.md
- name: FinOps toolkit SRE Agent

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the naming convention: We don't put "FinOps toolkit" as a prefix on everything. Adding "Azure" to hopefully add context to what "SRE" means. Lowercasing "agent" per Microsoft style.

Suggested change
- name: FinOps toolkit SRE Agent
- name: Azure SRE agent

Comment thread docs-mslearn/TOC.yml
href: toolkit/sre-agent/security.md
- name: Troubleshooting
href: toolkit/sre-agent/troubleshooting.md
- name: Template reference

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a deployment template? If so, I'd be consistent with hubs:

Suggested change
- name: Template reference
- name: Deployment template

Comment on lines +205 to +207
- [Deploy Azure SRE Agent with the FinOps toolkit](deploy.md)
- [Azure SRE Agent in the FinOps toolkit](overview.md)
- [FinOps hubs](../hubs/finops-hubs-overview.md)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- [Deploy Azure SRE Agent with the FinOps toolkit](deploy.md)
- [Azure SRE Agent in the FinOps toolkit](overview.md)
- [FinOps hubs](../hubs/finops-hubs-overview.md)
- [FinOps hubs](../hubs/finops-hubs-overview.md)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove everything added in the docs/deploy folder. Those shouldn't be added now. They aren't part of v14.

Comment on lines -205 to -206
Get-ChildItem $destDir -Force -Recurse -Filter ".DS_Store" | Remove-Item -Force

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this removed? We don't want to package it.

Fwiw, we could add this to the -notin list on line 201.

Comment on lines +212 to +228

# Inject the recipe package SHA-256 into azuredeploy.json so the deployment
# script can verify package integrity before extraction. Runs only when a
# template ships both the compiled template and a recipe zip; no-op otherwise.
$deployJson = "$targetDir/azuredeploy.json"
$recipeZip = "$targetDir/sre-agent-recipe.zip"
if ((Test-Path $deployJson) -and (Test-Path $recipeZip))
{
$deployContent = Get-Content $deployJson -Raw
if ($deployContent -match 'PLACEHOLDER_RECIPE_PACKAGE_SHA256')
{
$recipeSha = (Get-FileHash -Algorithm SHA256 -Path $recipeZip).Hash.ToLower()
$deployContent = $deployContent -replace 'PLACEHOLDER_RECIPE_PACKAGE_SHA256', $recipeSha
Set-Content -Path $deployJson -Value $deployContent -NoNewline
Write-Verbose " Injected recipe package SHA-256 ($recipeSha) into $deployJson"
}
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not add SRE-specific code here. This is tech debt. Let's make sure this script can stay generic and we have conventions that help accomplish what's needed. Happy to discuss and brainstorm ideas. This PR is too big for me to see the forest thru the trees right now 😕

- `bin/deploy.sh` — Canonical deployment entry point copied from the Microsoft starter-lab setup flow and updated for no-azd FinOps deployment
- `infra/` — Copied-and-updated Microsoft starter-lab Bicep baseline
- `recipes/finops-hub/` — Recipe content
- `../claude-plugin/output-styles/ftk-output-style.md` — Uploaded as SRE Agent knowledge and referenced by every scheduled task for report formatting

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this file used? I'm assuming we can't link to other folders.

@microsoft-github-policy-service microsoft-github-policy-service Bot added Needs: Attention 👋 Issue or PR needs to be reviewed by the author or it will be closed due to no activity and removed Needs: Review 👀 PR that is ready to be reviewed labels Jun 18, 2026

@flanakin flanakin left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 [AI][Claude Code] PR Review

Summary: Strong, well-tested PR. All of @RolandKrummenacher's prior blockers are verified resolved in the tree (secure-by-default Low/review, SHA256 fail-closed integrity verification with https + host allowlist, vendored capacity skill with no submodule/symlinks, telemetry.sh removed, no committed zips, fiscal year documented as a worked example, RBAC includes Monitoring Reader). The 39-test Pester suite locks the secure defaults and integrity checks in place, and the "50 tools" claim (37 Kusto + 13 Python) checks out. Only 2 minor suggestions below — no blockers or should-fix items.

💡 Suggestions (2)

  1. Changelog "Changed" entry states the action mode was set to Autonomous, which contradicts the now-default review mode in the deployment templates.
  2. createUiDefinition.json lets a user escalate to High + autonomous with only a tooltip — an InfoBox warning would reinforce the opt-in.

- Added 6 knowledge documents for agent onboarding, artifact verification, Teams notification patterns, known issues, document index guidance, and FinOps Toolkit output style.
- Added FinOps toolkit SRE Agent documentation pages for Microsoft Learn.
- **Changed**
- Set agent action mode to Autonomous so scheduled tasks can deliver reports without human approval.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 [AI][Claude Code] 💡 Suggestion

This line says the agent action mode was "set to Autonomous," but the secure-by-default resolution changed the shipped defaults: main.bicep (portal) defaults actionMode to review, and createUiDefinition.json defaults to review as well. A reader scanning the changelog would conclude autonomous is the out-of-box behavior, which it no longer is for the portal path.

Consider rewording to reflect that the agent defaults to review mode, with autonomous available as an opt-in for scheduled task delivery — e.g.:

Suggested change
- Set agent action mode to Autonomous so scheduled tasks can deliver reports without human approval.
- Defaulted the deployment to review mode (read-only `Low` access); autonomous mode is an opt-in for unattended scheduled task delivery.

"visible": true
},
{
"name": "actionMode",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 [AI][Claude Code] 💡 Suggestion

The defaults here are now safe (accessLevel=Low, actionMode=review), which resolves the earlier blocker. As a follow-on hardening, selecting High (Contributor on target resource groups) and/or autonomous is currently conveyed only via the tooltip. Adding a Microsoft.Common.InfoBox (style Warning) that appears when the operator escalates would make the security trade-off explicit at the point of choice and reinforce the opt-in, matching @RolandKrummenacher's original suggestion to gate High+autonomous behind an explicit acknowledgement. Optional — current defaults already protect the one-click path.

@MSBrett

MSBrett commented Jun 21, 2026

Copy link
Copy Markdown
Contributor Author

Closing this PR. The SRE agent content here grew too large and is chasing a moving target. We'll revisit with a smaller, content-only approach aligned to the Azure SRE Agent plugin model.

@MSBrett MSBrett closed this Jun 21, 2026
@microsoft-github-policy-service microsoft-github-policy-service Bot added Needs: Review 👀 PR that is ready to be reviewed and removed Needs: Attention 👋 Issue or PR needs to be reviewed by the author or it will be closed due to no activity labels Jun 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Needs: Review 👀 PR that is ready to be reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants