obs(grafana): pair stat + timeseries on Pebble internals dashboard by bootjp · Pull Request #609 · bootjp/elastickv

bootjp · 2026-04-23T19:54:27Z

Summary

Follow-up to #597 that landed before this improvement could ride along. The Pebble Internals dashboard rendered as flat stat-only rows (5 node values per metric), which is a bad fit for LSM signals that matter as trends (L0 build-up, compaction debt, memtable count approaching write-stall).

This PR restructures every gauge panel into a stat (current value, colour-coded) + timeseries (per-node history with threshold line) pair.

Changes

Single file: monitoring/grafana/dashboards/elastickv-pebble-internals.json.

Final shape: 26 panels — 5 collapsible rows + 21 data panels (10 stat + 11 timeseries).

Metric	Paired	Threshold line
`elastickv_pebble_l0_sublevels`	✓	red @ 20 (write-stall start)
`elastickv_pebble_l0_num_files`	✓	red @ 1000
`elastickv_pebble_memtable_count`	✓	red @ 5
`elastickv_pebble_memtable_size_bytes`	✓	—
`elastickv_pebble_memtable_zombie_count`	✓	—
`elastickv_pebble_compact_estimated_debt_bytes`	✓	—
`elastickv_pebble_compact_in_progress`	✓	—
`elastickv_pebble_block_cache_hit_rate`	✓ (pre-existing)	—

Rate counters (compact_count_total, block_cache_hits_total, block_cache_misses_total) left as timeseries-only — an instantaneous rate stat is meaningless.

elastickv_store_write_conflict_total is labelled by key_prefix not node_id, so per-node stat reduction doesn't apply. Single timeseries.

elastickv_pebble_fsm_apply_sync_mode stays a single stat (boolean mode indicator).

Layout: two 12-wide × 8-high panels per row, grouped under Block Cache / LSM Level State (collapsed) / Compaction / FSM Apply / Store Write Conflicts rows. Queries standardised on max by (node_id).

Test plan

JSON valid: python3 -c "import json; json.load(open('monitoring/grafana/dashboards/elastickv-pebble-internals.json'))"
Import into Grafana, panels render without errors (operator verification)

/gemini review
@codex review

…l thresholds Restructures the Pebble Internals dashboard so every gauge-style signal is represented by two panels: a stat (current per-node value with colour thresholds) and a timeseries (per-node trend, with red threshold lines at the Pebble write-stall points where applicable). Signals paired: - L0 sublevels (stat thresholds 10/15, line at 20) - L0 num files (stat thresholds 500/1000, line at 1000) - Compaction estimated debt (stat thresholds 100/500 MiB) - Compactions in progress (stat thresholds 1/3) - Memtable count (stat thresholds 3/5, line at 5) - Memtable size bytes - Memtable zombie count Rate-of-counter metrics (compaction rate, cache hits/misses rate, store write conflicts by prefix) and the FSM apply sync mode indicator stay single-panel because a point-in-time rate or boolean has no meaningful stat companion. Panels are grouped into rows: Block Cache, LSM Level State (collapsed), Compaction, FSM Apply, Store Write Conflicts. Queries use `max by (node_id)` to coalesce the group / node_address dimensions the Pebble collector emits. Grid layout is two 12-wide panels per visual row at height 8.

gemini-code-assist · 2026-04-23T19:54:31Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

coderabbitai · 2026-04-23T19:54:34Z

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 13 minutes and 2 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 13 minutes and 2 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 08cf1960-22d6-4503-b620-9717c13b37ba

📥 Commits

Reviewing files that changed from the base of the PR and between 99c0560 and b16137e.

📒 Files selected for processing (1)

monitoring/grafana/dashboards/elastickv-pebble-internals.json

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch obs/grafana-pebble-stat-timeseries

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 050b4d93a2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-23T19:58:21Z

+            {
+              "datasource": "$datasource",
+              "editorMode": "code",
+              "expr": "max by (node_id) (elastickv_pebble_memtable_size_bytes{job=\"elastickv\",node_id=~\"$node_id\"})",


Aggregate memtable bytes across groups per node

On nodes that host multiple Raft groups, elastickv_pebble_memtable_size_bytes is emitted per group, so max by (node_id) only keeps the largest group and drops the rest. That under-reports true per-node memtable footprint (e.g., two 200MiB groups still display 200MiB), which can hide pressure in both the new stat thresholds and the paired trend panel; this should be sum by (node_id) if the panel is intended to represent per-node memory usage.

Useful? React with 👍 / 👎.

Codex P2: the Memtable Size panels used max by (node_id), but elastickv_pebble_memtable_size_bytes is emitted per raft group. On nodes that host more than one group max would pick only the largest group memtable and drop the rest, so two 200 MiB groups still displayed 200 MiB instead of 400 MiB. That under-reports real per-node memtable pressure and could hide exactly the "too many memtables queued for flush" signal the paired trend panel exists to surface. Switched both the "Memtable Size (current)" stat and the "Memtable Size Over Time" timeseries to sum by (node_id). Updated the two panel descriptions to spell out the aggregation choice so a future reader can tell it is intentional (max is correct for L0 sublevels because sublevels is per-group stall risk, but wrong for memory totals).

bootjp · 2026-04-24T21:16:20Z

Addressed Codex P2 in db6115f:

Aggregate memtable bytes across groups per node. elastickv_pebble_memtable_size_bytes is emitted per raft group, so max by (node_id) was picking only the largest group memtable on multi-group nodes and dropping the rest — two 200 MiB groups still displayed 200 MiB instead of 400 MiB, under-reporting real per-node memory pressure.

Both the "Memtable Size (current)" stat and the "Memtable Size Over Time" timeseries now use sum by (node_id). Panel descriptions updated to spell out the aggregation choice so a future reader can tell it is deliberate.

Note on the other panels. max by (node_id) is still correct for the L0 sublevels / num_files panels — sublevels is a per-group stall-risk signal where the worst group is what matters, and swapping to sum there would mask backpressure on the group actually being throttled. The memtable-bytes case is different because the metric is a bytes counter, so total node memory really is the sum across groups. Leaving the other panels untouched keeps this PR scoped to the flagged issue.

/gemini review
@codex review

gemini-code-assist · 2026-04-24T21:16:23Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: db6115feb4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-24T21:21:15Z

          "datasource": "$datasource",
          "editorMode": "code",
-          "expr": "elastickv_pebble_l0_num_files{job=\"elastickv\",node_id=~\"$node_id\"}",
+          "expr": "max by (node_id) (elastickv_pebble_compact_estimated_debt_bytes{job=\"elastickv\",node_id=~\"$node_id\"})",


Sum compaction debt across groups for per-node panels

elastickv_pebble_compact_estimated_debt_bytes is emitted per raft group (it carries a group label), but this panel reduces with max by (node_id). On multi-group nodes, that drops all but the largest group’s debt (for example, two 400 MiB groups still render 400 MiB), so the "per node" debt stat/trend under-reports real backlog and can mask compaction pressure. Use sum by (node_id) (or explicitly relabel this as worst-group debt).

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-24T21:21:15Z

          "editorMode": "code",
-          "expr": "elastickv_pebble_fsm_apply_sync_mode{job=\"elastickv\",node_id=~\"$node_id\",mode=\"nosync\"}",
-          "legendFormat": "{{node_id}} nosync",
+          "expr": "max by (node_id) (elastickv_pebble_compact_in_progress{job=\"elastickv\",node_id=~\"$node_id\"})",


Aggregate compactions-in-progress across groups

This query is described as "current number of compactions running per node," but max by (node_id) only keeps the busiest single group on each node. If a node runs compactions in several groups at once, the panel undercounts (e.g., 2 + 2 active compactions appears as 2), which can hide node-level compaction saturation. This should be sum by (node_id) for a true per-node count, or the panel text should be changed to worst-group semantics.

Useful? React with 👍 / 👎.

Codex P2 follow-up on db6115f: same aggregation bug applied to the Compaction section. elastickv_pebble_compact_estimated_debt_bytes and elastickv_pebble_compact_in_progress are both emitted per raft group, but the panels reduced with max by (node_id). On multi-group nodes: - Two groups each holding 400 MiB of compaction debt rendered as 400 MiB instead of 800 MiB, hiding the real per-node backlog. - Two groups each running 2 compactions rendered as 2 instead of 4, hiding node-level compaction saturation. Switched both stat and timeseries panels for both metrics to sum by (node_id). Updated all four descriptions to spell out the aggregation choice and the (per-group max × group-count) ceiling operators should expect, so a future reader can tell the change is intentional. The L0 sublevels / num_files panels remain on max by (node_id) — those are per-group stall-risk signals where the worst group is what matters; summing would mask backpressure on the specific group being throttled.

bootjp · 2026-04-24T21:34:47Z

Addressed Codex P2 (x2) follow-up in b16137e:

Compaction debt and compactions-in-progress now sum across groups per node. Same per-group aggregation bug as the memtable panels fixed in db6115f: both metrics carry a group label, and max by (node_id) dropped all but the busiest group. On multi-group nodes that meant two groups each holding 400 MiB of debt rendered as 400 MiB instead of 800 MiB, and two groups each running two compactions rendered as 2 instead of 4 — hiding per-node compaction pressure the panels were specifically meant to surface.

Switched both stat + timeseries panels for compact_estimated_debt_bytes and compact_in_progress to sum by (node_id). Descriptions updated to spell out the aggregation and the (per-group max × group-count) ceiling an operator should expect.

Still leaving max by (node_id) on the L0 sublevels / num_files panels — those are per-group stall-risk signals where the worst group is what matters, and summing there would dilute exactly the backpressure signal that drives write-stall investigation.

/gemini review
@codex review

gemini-code-assist · 2026-04-24T21:34:49Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

chatgpt-codex-connector · 2026-04-24T21:38:56Z

Codex Review: Didn't find any major issues. Another round soon, please!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector Bot reviewed Apr 23, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

obs(grafana): pair stat + timeseries on Pebble internals dashboard#609

obs(grafana): pair stat + timeseries on Pebble internals dashboard#609
bootjp wants to merge 3 commits intomainfrom
obs/grafana-pebble-stat-timeseries

bootjp commented Apr 23, 2026

Uh oh!

gemini-code-assist Bot commented Apr 23, 2026

Uh oh!

coderabbitai Bot commented Apr 23, 2026 •

edited

Loading

Rate limit exceeded

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 23, 2026

Uh oh!

bootjp commented Apr 24, 2026

Uh oh!

gemini-code-assist Bot commented Apr 24, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 24, 2026

Uh oh!

chatgpt-codex-connector Bot Apr 24, 2026

Uh oh!

bootjp commented Apr 24, 2026

Uh oh!

gemini-code-assist Bot commented Apr 24, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bootjp commented Apr 23, 2026

Summary

Changes

Test plan

Uh oh!

gemini-code-assist Bot commented Apr 23, 2026

Uh oh!

coderabbitai Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

bootjp commented Apr 24, 2026

Uh oh!

gemini-code-assist Bot commented Apr 24, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

bootjp commented Apr 24, 2026

Uh oh!

gemini-code-assist Bot commented Apr 24, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Apr 23, 2026 •

edited

Loading