Skip to content

DSL debugger (MAL): surface terminal payload.value with two-phase calculate; reduce session record cap to 100#13865

Merged
wu-sheng merged 2 commits intomasterfrom
dsl-debugging-terminal-value-and-cap
May 9, 2026
Merged

DSL debugger (MAL): surface terminal payload.value with two-phase calculate; reduce session record cap to 100#13865
wu-sheng merged 2 commits intomasterfrom
dsl-debugging-terminal-value-and-cap

Conversation

@wu-sheng
Copy link
Copy Markdown
Member

@wu-sheng wu-sheng commented May 9, 2026

Fix two related issues in the live debugger surface for unreleased SWIP-13 code

Bug 1 — terminal MAL output sample shows the wrong shape in the UI.
Operators inspecting a histogram_percentile([...]) rule see bucket-keyed
samples ((service_name=…, le=10), …, le=20, …) in the value column instead
of the percentile result they expect to see in storage ({p=50}, {p=90}, {p=99}).
Two reasons: (1) the captured output sample never carried the metric's actual
reading on payload.value — the wire only had metric / entity / valueType
/ timeBucket, so the UI fell back to rendering the upstream
histogram_percentile function-stage SampleFamily, which legitimately has
le= keyed samples but isn't what the rule emits. (2) Even if we add payload.value,
two-phase functions (AvgHistogramPercentileFunction, SumHistogramPercentileFunction)
populate their user-visible field (percentileValues) lazily in calculate(),
which the L1 streaming pipeline calls after the captureMeterEmit probe site,
so a naive getValue() read returns the empty pre-calculate map.

This PR fixes both:

  • New appendValue(JsonObject, AcceptableValue<?>) in MALDebugRecorderImpl
    renders payload.value per holder type — JSON number for LongValueHolder
    / IntValueHolder / DoubleValueHolder (with \"NaN\" / \"Infinity\"
    string fall-back for non-finite doubles so the wire stays valid JSON), and
    JsonObject {key: long} for LabeledValueHolder (label combos for
    *Labeled functions, p=<rank> entries for percentile functions).
  • Forces Metrics.calculate() once before the getValue() read. Safe because
    the function's isCalculated guard makes the streaming pipeline's later
    call a no-op, and cross-node combine() resets isCalculated=false so
    post-merge state still recomputes on read.

Both edits are inside MALDebugRecorderImpl, which is only reached from the
gated captureMeterEmit probe — when no operator has installed a session the
probe gate is JIT-elided on hot paths, so steady-state runtime cost is zero.

Bug 2 — recordCap ceiling is too generous for an operator-driven debugger.
Hard cap was 10000 records / session, default 1000. Both burn UI page
real estate and heap (per-record JSON ≈ 10 KiB → ~100 MiB worst case per
session). Operators inspect a handful of executions, not a paginated firehose.

This PR drops both to 100 (default = hard cap = MAX_RECORD_CAP),
yielding ~1 MiB worst-case heap per session. Out-of-range requests still get
400 invalid_limits from the constructor.

  • Add a unit test to verify that the fix works.

    • All 16 existing dsl-debugging unit tests pass against the new code; no
      fixture broke. The cap-aware test (DebugSessionRegistryTest.reapExpired_…)
      was switched to SessionLimits.MAX_RECORD_CAP so it tracks the cap going
      forward instead of pinning a stale literal.
    • The MAL e2e flow now asserts payload.value is present and numeric on
      every terminal output sample.
    • Live-verified against a local stack feeding a histogram_percentile
      rule: terminal sample now reports
      value: {\"{p=50}\": 250000, \"{p=90}\": 5000000, \"{p=99}\": 10000000}.
  • Explain briefly why the bug exists and how to fix it.

    • See above. Both bugs are debug-only paths; fix is debug-only and gated.
  • If this pull request closes/resolves/fixes an existing issue, replace the issue number. Closes #.

  • Update the CHANGES log.

    • Not required: this patches unreleased SWIP-13 code that's already in the
      10.5.0 changelog entry. The wire/cap changes won't be visible to any
      released version.

…lculate); reduce session record cap to 100

- MAL terminal output sample now carries payload.value rendered per holder
  type: Long/Int/Double scalars as JSON numbers (NaN / +-Infinity as strings),
  LabeledValueHolder DataTables as JSON object {key: long}. Force
  Metrics.calculate() at probe time for two-phase functions
  (AvgHistogramPercentileFunction / SumHistogramPercentileFunction) so the
  captured value carries the percentile result keyed by p=<rank> rather
  than the empty pre-calculate map.
- Per-session record cap dropped from 10000 (default 1000) to 100 hard
  cap = default. ~1 MiB worst-case heap per session and a UI-readable
  page; operators inspect a handful of executions, not a paginated
  firehose. SessionLimits constructor still rejects out-of-range with
  IllegalArgumentException -> 400 invalid_limits.
- Both changes are debug-only and zero-cost when no session is installed:
  the captureMeterEmit probe gate is JIT-elided on hot paths, calculate()
  is idempotent via the function's isCalculated guard, and combine()
  resets isCalculated=false so cluster-merged state still recomputes on
  read.
- Updated the MAL admin-API doc with a concrete payload example and the
  SWIP-13 reference cap line / heap math.
- E2E MAL flow asserts payload.value is present and numeric on every
  terminal output sample.
- All 16 existing dsl-debugging unit tests pass.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts the MAL DSL live-debugger surface to (1) include the terminal meter emission’s computed value on the wire (including two-phase functions via an early calculate()), and (2) reduce per-session record retention to a much smaller cap to limit heap/UI impact.

Changes:

  • Add payload.value to terminal meterEmit samples by rendering scalar and labeled holders, and force Metrics.calculate() before reading values.
  • Reduce session record cap to 100 (default and hard cap) and update related tests/docs.
  • Extend MAL e2e validation to assert terminal payload.value is present and numeric for the seeded rule.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
test/e2e-v2/cases/dsl-debugging/mal/dsl-debug-flow.sh Adds an e2e assertion that terminal meterEmit includes numeric payload.value.
oap-server/server-admin/dsl-debugging/src/test/java/org/apache/skywalking/oap/server/admin/dsl/debugging/session/DebugSessionRegistryTest.java Updates a test to track the new MAX_RECORD_CAP constant.
oap-server/server-admin/dsl-debugging/src/main/java/org/apache/skywalking/oap/server/admin/dsl/debugging/session/SessionLimits.java Lowers MAX_RECORD_CAP to 100 and updates defaults/commentary.
oap-server/server-admin/dsl-debugging/src/main/java/org/apache/skywalking/oap/server/admin/dsl/debugging/session/AbstractDebugRecorder.java Updates inline documentation referencing the new default cap.
oap-server/server-admin/dsl-debugging/src/main/java/org/apache/skywalking/oap/server/admin/dsl/debugging/mal/MALDebugRecorderImpl.java Adds appendValue(...) to emit terminal payload.value and forces calculate() for two-phase functions.
docs/en/swip/SWIP-13.md Updates SWIP-13 contract text to reflect the new record cap sizing and rationale.
docs/en/setup/backend/admin-api/dsl-debugging-mal.md Updates the sample payload to include value in terminal output samples.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +226 to +242
* BEFORE the streaming pipeline calls calculate(), so without forcing
* calculate() here the labeled value column would be an empty map for
* histogram-percentile rules — exactly when operators most need to
* verify what the rule emits. We force calculate() at probe time so
* the captured value matches what the storage row will contain. This
* is safe and zero-cost when debug is off:
* <ul>
* <li>The probe site itself is gated; when no operator has installed
* a session the call into this method is dead code that the JIT
* elides on hot-path inlining.</li>
* <li>{@code calculate()} is idempotent via the function's
* {@code isCalculated} guard, so the streaming pipeline's later
* call is a no-op rather than double work.</li>
* <li>Cross-node {@code combine()} resets {@code isCalculated=false},
* so the post-merge calculate on combined state still happens
* on read — pre-computing here on the local snapshot doesn't
* leak stale values into the cluster.</li>
Comment on lines 181 to +194
{ "type": "output",
"sourceText": "e2e_demo_filtered_requests",
"continueOn": true,
"payload": { /* terminal meter sample — metric, entity, value, timeBucket */ } }
"payload": {
"metric": "e2e_demo_filtered_requests",
"entity": "MeterEntity(scopeType=SERVICE, serviceName=my-svc, …)",
"valueType": "sum",
"timeBucket": 202605091036,
"value": 42 /* shape depends on valueType:
number for Sum/Avg/Max/Min/CPM/Latest…,
object {bucket: count} for histograms /
*Labeled functions, omitted for non-scalar
holders. NaN/±Infinity render as strings. */
} }
@wu-sheng wu-sheng added bug Something isn't working and you are sure it's a bug! backend OAP backend related. labels May 9, 2026
- MALDebugRecorderImpl javadoc: drop the misleading "calculate() is
  idempotent via isCalculated guard" claim. Reality: most
  Metrics.calculate() implementations have no guard (AvgFunction,
  CPMMetrics, SumMetrics just recompute every call), and the histogram-
  percentile pair (AvgHistogramPercentileFunction,
  SumHistogramPercentileFunction) check isCalculated at entry but never
  set it to true on exit, so even those re-run on the streaming pipeline.
  Reword to call out the bounded-but-real cost (one extra calculate per
  probed emission per debug session) and keep the gating story
  (probe-site JIT-elision when no session is installed; cluster combine()
  resets the flag so peer reads still recompute).

- dsl-debugging-{mal,oal,lal}.md Limits sections: bring defaults +
  examples in line with the new SessionLimits.MAX_RECORD_CAP = 100.
  Previous tables said default 1000 with override examples like
  {"recordCap": 200} — the new ceiling rejects those with
  400 invalid_limits. Add explicit hard-cap columns for both recordCap
  and retentionMillis so operators see the validated range up front.
@wu-sheng wu-sheng merged commit fd54f92 into master May 9, 2026
442 of 445 checks passed
@wu-sheng wu-sheng deleted the dsl-debugging-terminal-value-and-cap branch May 9, 2026 12:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend OAP backend related. bug Something isn't working and you are sure it's a bug!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants