feat(mcp): rewrite dashboard authoring prompts; expose filters on save tool#2264
feat(mcp): rewrite dashboard authoring prompts; expose filters on save tool#2264alex-fedotyev wants to merge 9 commits into
Conversation
🦋 Changeset detectedLatest commit: e409a29 The changes in this PR will be included in the next version bump. This PR includes changesets to release 3 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
🟡 Tier 3 — StandardIntroduces new logic, modifies core functionality, or touches areas with non-trivial risk. Why this tier:
Review process: Full human review — logic, architecture, edge cases. Stats
|
PR Review✅ No critical issues found. Nice surface coverage: the create/update filter-id normalization ( Minor (non-blocking) observations:
|
E2E Test Results✅ All tests passed • 178 passed • 3 skipped • 1268s
Tests ran across 4 shards in parallel. |
Deep Review🟡 P2 -- recommended
🔵 P3 nitpicks (8)
Reviewers (5): correctness, testing, maintainability, kieran-typescript, api-contract. Testing gaps:
|
…ilters on save tool
The `create_dashboard` prompt now leads with a ten-rule design checklist
(RED columns with aliases, per-series numberFormat for durations,
groupByColumnsOnLeft for inventory tables, dashboard-level filters
instead of per-tile where literals, one-metric-per-tile for metric
sources, containers and tabs for grouping). The wall-of-JSON canonical
example is gone; the four dashboard_examples patterns carry the
concrete shapes.
The dashboard_examples set is replaced with four verified patterns
(service_inventory, service_detail, log_analytics,
backend_dependencies) plus the existing infrastructure_sql. Each
non-SQL example leads with a "When to use" header and a "Why this
shape" note so the model picks by intent, not by surface keyword
match. Examples were built and rendered on a live dev stack before
landing, which surfaced two design issues now captured as rules in
the prompt:
- chart-level numberFormat on a table that mixes counts and durations
formats every value as a duration; per-series numberFormat is the fix.
- metric tiles take exactly one select item (renderer destructures
select[0] and ignores the rest).
The query_guide prompt gains a DASHBOARD FILTERS section that
documents the filters: [{ type, name, expression, sourceId, where?,
whereLanguage? }] shape, a NUMBER FORMAT section that explains the
per-series vs. chart-level distinction, and a PER-TILE TYPE
CONSTRAINTS note about the metric one-select rule.
hyperdx_save_dashboard now accepts `filters` on its input schema,
reusing externalDashboardFilterSchemaWithId so the MCP and REST
surfaces stay in lockstep and the existing
convertExternalFiltersToInternal helper handles the conversion
without translation. Filters round-trip through create, get, and
update.
Voice pass: every prompt string is now em-dash-free, with a snapshot
test guarding regressions.
Drill-down behavior (onClick from a service inventory row into the
service detail dashboard) is a separate follow-up PR; landing the
prompt rewrite first.
The buildSourceSummary helper is the source-list block that buildCreateDashboardPrompt prepends to the prompt body. It still carried em-dashes from before the voice pass, so the assembled `create_dashboard` prompt that the MCP transport returns to a client contained eight em-dashes even though every builder in content.ts is clean. Caught by an end-to-end check that fetched the prompt through the live MCP transport and counted em-dashes in the response payload.
…rompts
claude-review flagged:
- `service_detail` example uses `having: "StatusMessage != ''"` on a
table tile, but mcpTableTileSchema.config does not include `having`.
Zod's `.strip()` silently drops the field, so an LLM following the
example through MCP would save a table that includes empty-message
rows. Added `having: z.string().max(10000).optional()` mirroring
externalDashboardTableChartConfigSchema and a round-trip test.
- Heatmap tile's numberFormat describe still referenced
`output: "time"` while the rest of the file uses `output: "duration"`.
Aligned to "duration" for consistency with the prompt guidance.
deep-review additionally flagged:
- query_guide told the model that metric tiles take "1 select item with
metricName + metricType", but mcpTileSelectItemSchema does not expose
those fields, so the keys are silently stripped. Replaced the metric
authoring guidance with an explicit "not currently exposed via the MCP
schema; use raw SQL for infrastructure metrics" note. Same edit on
the design-checklist rule 9 (replaced with a "replace, not merge"
update-semantic note that addresses a separate review concern about
partial-payload data loss on update).
- mcpFiltersParam advertised id as optional via .partial({ id: true }),
but createDashboardBodySchema rejects id (strict, no id) and
updateDashboardBodySchema requires id. So an LLM copy-pasting a
filter from get-dashboard into create, or adding a brand-new filter
on update without an id, would hit a confusing strict-validation
rejection. saveDashboard.ts now normalizes per flow: stripFilterIds()
on create, assignFilterIds() on update.
- Em-dash regression check did not cover buildQueryGuidePrompt (largest
prompt, most likely to regress) or buildSourceSummary (helpers.ts
carried em-dashes through the initial PR until fix `3539649e`). Added
assertions for both.
- Bad-sourceId test asserted only `text.toContain('source')`, which
matches almost any error string. Tightened to assert
"Could not find source IDs" plus the literal bad id.
- Numbered-rule check used `prompt.toContain('${i}.')` which would
match substrings like "1.2s" or "0.000000001". Anchored the check
to line start (`/^${i}\\. /m`) and scoped it to the DESIGN CHECKLIST
section.
- DASHBOARD FILTERS and NUMBER FORMAT body content was not asserted
beyond the heading. Now asserts substantive content (QUERY_EXPRESSION
/ expression / sourceId for filters; factor: 0.000000001 / duration /
per-series for number format).
- Filter round-trip did not cover the freshly-generated-id branch of
convertExternalFiltersToInternal at the MCP layer. Added a test that
ships a new no-id filter on update and asserts saveDashboard assigns
a fresh id.
…oard The service_inventory pattern's Services table now carries an onClick that drills into the partner "Service Detail" dashboard. Using mode: "template" with the constant template "Service Detail" resolves the target by name at click time, so the canonical flow (save both dashboards in one shot) works without needing to thread the detail dashboard's ID back into the inventory tile. The onClick filter's expression "ServiceName" matches the service_detail filter declaration, so the destination dashboard's "Service" dropdown auto-populates with the clicked row's value rather than asking the user to re-select it. Snapshot test asserts the template target, filter expression, and template field shape so a future refactor cannot silently break the inventory -> detail drill-down link.
cd37010 to
be223fb
Compare
Worked Claude through five starter dashboards on a fresh stack; this pass
folds the patterns it got wrong back into the prompts.
DESIGN CHECKLIST changes:
- Rule 2 (alias) now reads ALIAS EVERY SELECT ITEM, not ALIAS EVERY
AGGREGATION. Claude landed three number tiles with no alias on the
quantile() select item; rule 2 as written felt table-specific.
- Rule 10 (containers) is now REQUIRED at five-plus tiles, not a soft
hint. Claude built five dashboards averaging ten tiles each with
zero containers when the rule was a SHOULD.
- New rule 11 VALIDATE EVERY TILE AFTER SAVE, separated from the
workflow step so it has its own anchor in the checklist.
- New rule 12 NO TITLE-RECAP MARKDOWN TILE. Claude added a 24x2
"About this dashboard" markdown tile to every starter dashboard; the
renderer styles markdown headings at title-bar scale so the tile
ate a row of vertical space before the first KPI showed.
TILE TYPE GUIDE: markdown line rewritten to actively discourage the
title-recap habit (use containers/tabs for sectioning, skip the
markdown tile entirely on starters, no headings inside markdown bodies
if you must add one).
QUERY GUIDE additions:
- LUCENE FILTER SYNTAX gains a GOTCHA block: Lucene comparison
operators and wildcards on dotted attribute paths
(http.status_code:>=500, http.route:*) parse and save without error
but fail silently at query time. Workaround: switch to SQL with
bracket access. This was the cause of every "Error loading chart"
Claude hit during its run.
- PER-TILE TYPE CONSTRAINTS expanded for metric sources: builder tile
authoring on metric sources currently renders as "Both table name
and UUID are empty" at query time. Use a raw SQL tile with explicit
table reference (otel_metrics_gauge / otel_metrics_sum /
otel_metrics_histogram) until the builder catches up. Discovery note
added because metric sources don't publish mapAttributeKeys the way
log and trace sources do.
- TABLE TILE LINKING gains a GROUPBY ALIASES AND ROW-CLICK TEMPLATES
subsection. Builder tiles don't expose a groupBy alias today, so
groupBy: SpanAttributes['http.route'] produces a column name no
Handlebars template can reference as {{Route}}. Workaround:
configType: "sql" with explicit AS Route.
- COMMON MISTAKES rule 8 (validate-after-save) tightened to "every
tile, not just one". Rule 13 (missing alias) and rule 14 (Lucene
comparison on map paths) added with concrete wrong/correct examples.
Rules 11/12 onClick entries were duplicate-numbered in the
cherry-pick; renumbered to 15-18.
- New REFERENCES section at the end of the query guide pins three
ClickStack / ClickHouse doc URLs: search syntax for Lucene, the SQL
reference for the where field with whereLanguage:"sql", and the
sql-visualizations page for the $__timeFilter / $__timeInterval /
$__filters macros used in raw SQL tiles.
DEFAULT TIME WINDOW: short note in the create prompt that dashboards
open at 15 minutes and the user-side time picker is currently the only
way to widen the window. Tells the agent to surface this to the user
rather than padding individual tile time ranges to compensate.
Snapshot tests extended to lock the new rules in place: design
checklist now iterates 1 to 12, asserts on the strengthened phrasings
("ALIAS EVERY SELECT ITEM", "REQUIRED at five or more tiles",
"VALIDATE EVERY TILE AFTER SAVE", "NO TITLE-RECAP MARKDOWN TILE"),
plus per-section coverage tests for the GROUPBY ALIASES workaround,
the Lucene gotcha (in both LUCENE FILTER SYNTAX and COMMON MISTAKES),
the metric-source workaround text, and the three reference URLs.
…ond pass Watched Claude run on a fresh stack with the previous prompt revision; this commit closes the remaining gaps surfaced by that run. DESIGN CHECKLIST changes: - Rule 2 (alias) now ships a copy-pasteable number-tile example next to the rule body. The prior phrasing got better tables but Claude still dropped aliases on every builder number tile across three dashboards (P50/P95/P99 latency, Server Requests, Failed Spans, Log Errors). Concrete shape next to the text is what makes it stick. - Rule 10 (containers) now ships a copy-pasteable containers/tabs block. The prior REQUIRED phrasing did not bite at all; Claude built five dashboards with 9-10 tiles each and zero containers. The model reads "REQUIRED" as a soft hint when the next concrete shape is two files away in the dashboard_examples prompt. Inline the shape and the tile -> containerId wiring directly in the rule. WORKFLOW changes: - Now six steps. Step 2 says "list existing dashboards with hyperdx_get_dashboard, fetch one or two, learn local idioms before designing", which Claude's reflection asked for. Step 4 says "sketch tiles, THEN group them into 2-4 containers before assembling the save payload" so the container decision happens at the same level as picking source kind, not buried in a checklist. QUERY GUIDE changes: - LUCENE FILTER SYNTAX gotcha broadened from "comparison and wildcard on dotted map-attribute paths" to "ALL Lucene operations on map-attribute paths". Claude saw db.system:mongodb (simple equality) also fail to translate to SpanAttributes['db.system'] on this round. The previous phrasing said "simple equality on a dotted path is fine"; that was wrong. The corrected guidance is: use SQL with bracket access for any operation on map-attribute keys. - Added a FUZZY-MATCH NOTE alongside the gotcha: Lucene field:value on top-level string columns translates to ilike(field, '%value%'), not equality. That is fine for free-text columns but surprising for enum-like ones (SpanKind, SeverityText, StatusCode). Claude hit this on SpanKind:Server matching broader strings than intended. The canonical fix is whereLanguage: "sql" with =. - COMMON MISTAKES rule 14 rewritten to cover any-operation, with db.system:mongodb as the equality example. - Rule 15 added for the fuzzy-substring trap on top-level enum-like columns. - Rule 16 added for the empty-string trap on map-attribute groupBy: ClickHouse map columns return '' for unset keys, so groupBy on a sparse attribute produces an empty bucket alongside real values. Workaround: where: "SpanAttributes['<key>'] != ''" alongside the groupBy. Comes from Claude's MongoDB dashboard work where the empty collection-name bucket was the loudest visual noise. - Renumbered the onClick mistakes from 15-18 to 17-20 to make room. Tests extended to assert: workflow has six numbered steps and names the read-existing and group-into-containers steps explicitly; rule 2 carries the Server Requests example; rule 10 carries the kpis/trends/ errors shape; LUCENE FILTER SYNTAX mentions both wildcard and equality gotchas plus the FUZZY-MATCH NOTE; COMMON MISTAKES has dedicated rules for the map-attr-any-operation gotcha, the enum-like fuzzy-match trap, and the empty-string map-attr trap. All 6 example patterns still save and query cleanly end-to-end (48 tiles checked through hyperdx_query_tile).
Adds a new rule 3 GROUP BY HAS NO ALIAS HOOK that calls out the schema limit on the v2 chart config: groupBy is a single expression string, so the renderer uses it verbatim as the table column header and as the raw column name in CSV export, tooltips, orderBy, and onClick template references. A table grouped by SpanAttributes['http.route'] renders arrayElement(SpanAttributes, 'http.route') as the header. The rule pushes the model toward a top-level column when one carries the same semantic (SpanName / ServiceName / SeverityText) and falls back to a raw SQL tile with AS alias when only a Map attribute is available. Second-pass dashboards consistently produced ugly Map-expression headers on App-Performance tables even though every alias rule on the select side was already in the checklist; the gap was the missing acknowledgement that groupBy has no alias hook. Also fixes a pre-existing assertion in dashboards.test.ts left over from the rewrite: the test expected the phrase "not currently exposed via the MCP" which is no longer in the PER-TILE TYPE CONSTRAINTS section. Anchors on the surviving "Authoring builder tiles on a metric source is not reliable" and "MCP select-item shape does not carry" phrasing instead.
Summary
I rewrote the three MCP dashboard authoring prompts and added
filterstohyperdx_save_dashboard's input schema. The goal: when an agent calls the MCP server to build a dashboard, it picks the right pattern by intent, follows the renderer's actual constraints, and produces a JSON shape that renders correctly on the first try.What changed in the prompts:
create_dashboardnow leads with a ten-rule design checklist (RED columns with aliases, per-seriesnumberFormatfor durations,groupByColumnsOnLeftfor inventory tables, dashboard-level filters instead of per-tilewhereliterals, one-metric-per-tile for metric sources, containers and tabs for grouping). The wall-of-JSON canonical example is gone; the fourdashboard_examplespatterns carry the concrete shapes. The prompt is shorter and the rules are easier for the model to scan.dashboard_examplesis now four verified patterns (service_inventory,service_detail,log_analytics,backend_dependencies) plus the existinginfrastructure_sql. Each non-SQL example leads with a "When to use" header and a "Why this shape" note so the model picks by intent, not by surface keyword match. I built and rendered every example on a live dev stack before landing the prompt; that surfaced two design issues now captured as rules:numberFormaton a table that mixes counts and durations formats every value as a duration. Per-seriesnumberFormatis the fix.select[0]and ignores the rest; multi-metric authoring needs one tile per metric.query_guidegains aDASHBOARD FILTERSsection that documents thefilters: [{ type, name, expression, sourceId, where?, whereLanguage? }]shape, aNUMBER FORMATsection that explains the per-series vs. chart-level distinction, and aPER-TILE TYPE CONSTRAINTSnote about the metric one-select rule.What changed in
hyperdx_save_dashboard:filtersinput field on the tool'sinputSchema. ReusesexternalDashboardFilterSchemaWithIdfrom@/utils/zodso the MCP and REST surfaces stay in lockstep and the existingconvertExternalFiltersToInternalhelper handles the conversion without any translation layer. Same body schema (createDashboardBodySchema/updateDashboardBodySchema) that the v2 REST handler already uses.Voice pass: every prompt string is now em-dash-free, with a snapshot test guarding regressions.
Drill-down behavior (onClick from a service-inventory row into the service-detail dashboard) is a separate follow-up PR; landing the prompt rewrite first so review surface stays focused.
Test plan
yarn workspace @hyperdx/api jest mcp/__tests__/dashboards(51 passed, 0 failed)yarn workspace @hyperdx/api ci:lint(lint + tsc + openapi lint, clean)sourceIdreturns 4xxfiltersreturnsfilters: []prose-lint.pyon every changed file