Skip to content

obs(grafana): pair stat + timeseries on Pebble internals dashboard#609

Open
bootjp wants to merge 3 commits intomainfrom
obs/grafana-pebble-stat-timeseries
Open

obs(grafana): pair stat + timeseries on Pebble internals dashboard#609
bootjp wants to merge 3 commits intomainfrom
obs/grafana-pebble-stat-timeseries

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 23, 2026

Summary

Follow-up to #597 that landed before this improvement could ride along. The Pebble Internals dashboard rendered as flat stat-only rows (5 node values per metric), which is a bad fit for LSM signals that matter as trends (L0 build-up, compaction debt, memtable count approaching write-stall).

This PR restructures every gauge panel into a stat (current value, colour-coded) + timeseries (per-node history with threshold line) pair.

Changes

Single file: monitoring/grafana/dashboards/elastickv-pebble-internals.json.

Final shape: 26 panels — 5 collapsible rows + 21 data panels (10 stat + 11 timeseries).

Metric Paired Threshold line
elastickv_pebble_l0_sublevels red @ 20 (write-stall start)
elastickv_pebble_l0_num_files red @ 1000
elastickv_pebble_memtable_count red @ 5
elastickv_pebble_memtable_size_bytes
elastickv_pebble_memtable_zombie_count
elastickv_pebble_compact_estimated_debt_bytes
elastickv_pebble_compact_in_progress
elastickv_pebble_block_cache_hit_rate ✓ (pre-existing)

Rate counters (compact_count_total, block_cache_hits_total, block_cache_misses_total) left as timeseries-only — an instantaneous rate stat is meaningless.

elastickv_store_write_conflict_total is labelled by key_prefix not node_id, so per-node stat reduction doesn't apply. Single timeseries.

elastickv_pebble_fsm_apply_sync_mode stays a single stat (boolean mode indicator).

Layout: two 12-wide × 8-high panels per row, grouped under Block Cache / LSM Level State (collapsed) / Compaction / FSM Apply / Store Write Conflicts rows. Queries standardised on max by (node_id).

Test plan

  • JSON valid: python3 -c "import json; json.load(open('monitoring/grafana/dashboards/elastickv-pebble-internals.json'))"
  • Import into Grafana, panels render without errors (operator verification)

/gemini review
@codex review

…l thresholds

Restructures the Pebble Internals dashboard so every gauge-style signal
is represented by two panels: a stat (current per-node value with colour
thresholds) and a timeseries (per-node trend, with red threshold lines
at the Pebble write-stall points where applicable).

Signals paired:
- L0 sublevels (stat thresholds 10/15, line at 20)
- L0 num files (stat thresholds 500/1000, line at 1000)
- Compaction estimated debt (stat thresholds 100/500 MiB)
- Compactions in progress (stat thresholds 1/3)
- Memtable count (stat thresholds 3/5, line at 5)
- Memtable size bytes
- Memtable zombie count

Rate-of-counter metrics (compaction rate, cache hits/misses rate, store
write conflicts by prefix) and the FSM apply sync mode indicator stay
single-panel because a point-in-time rate or boolean has no meaningful
stat companion.

Panels are grouped into rows: Block Cache, LSM Level State (collapsed),
Compaction, FSM Apply, Store Write Conflicts. Queries use
`max by (node_id)` to coalesce the group / node_address dimensions the
Pebble collector emits. Grid layout is two 12-wide panels per visual
row at height 8.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 23, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 13 minutes and 2 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 13 minutes and 2 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 08cf1960-22d6-4503-b620-9717c13b37ba

📥 Commits

Reviewing files that changed from the base of the PR and between 99c0560 and b16137e.

📒 Files selected for processing (1)
  • monitoring/grafana/dashboards/elastickv-pebble-internals.json
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch obs/grafana-pebble-stat-timeseries

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 050b4d93a2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

{
"datasource": "$datasource",
"editorMode": "code",
"expr": "max by (node_id) (elastickv_pebble_memtable_size_bytes{job=\"elastickv\",node_id=~\"$node_id\"})",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Aggregate memtable bytes across groups per node

On nodes that host multiple Raft groups, elastickv_pebble_memtable_size_bytes is emitted per group, so max by (node_id) only keeps the largest group and drops the rest. That under-reports true per-node memtable footprint (e.g., two 200MiB groups still display 200MiB), which can hide pressure in both the new stat thresholds and the paired trend panel; this should be sum by (node_id) if the panel is intended to represent per-node memory usage.

Useful? React with 👍 / 👎.

Codex P2: the Memtable Size panels used max by (node_id), but
elastickv_pebble_memtable_size_bytes is emitted per raft group. On
nodes that host more than one group max would pick only the largest
group memtable and drop the rest, so two 200 MiB groups still
displayed 200 MiB instead of 400 MiB. That under-reports real
per-node memtable pressure and could hide exactly the "too many
memtables queued for flush" signal the paired trend panel exists to
surface.

Switched both the "Memtable Size (current)" stat and the "Memtable
Size Over Time" timeseries to sum by (node_id). Updated the two
panel descriptions to spell out the aggregation choice so a future
reader can tell it is intentional (max is correct for L0 sublevels
because sublevels is per-group stall risk, but wrong for memory
totals).
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 24, 2026

Addressed Codex P2 in db6115f:

Aggregate memtable bytes across groups per node. elastickv_pebble_memtable_size_bytes is emitted per raft group, so max by (node_id) was picking only the largest group memtable on multi-group nodes and dropping the rest — two 200 MiB groups still displayed 200 MiB instead of 400 MiB, under-reporting real per-node memory pressure.

Both the "Memtable Size (current)" stat and the "Memtable Size Over Time" timeseries now use sum by (node_id). Panel descriptions updated to spell out the aggregation choice so a future reader can tell it is deliberate.

Note on the other panels. max by (node_id) is still correct for the L0 sublevels / num_files panels — sublevels is a per-group stall-risk signal where the worst group is what matters, and swapping to sum there would mask backpressure on the group actually being throttled. The memtable-bytes case is different because the metric is a bytes counter, so total node memory really is the sum across groups. Leaving the other panels untouched keeps this PR scoped to the flagged issue.

/gemini review
@codex review

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: db6115feb4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

"datasource": "$datasource",
"editorMode": "code",
"expr": "elastickv_pebble_l0_num_files{job=\"elastickv\",node_id=~\"$node_id\"}",
"expr": "max by (node_id) (elastickv_pebble_compact_estimated_debt_bytes{job=\"elastickv\",node_id=~\"$node_id\"})",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Sum compaction debt across groups for per-node panels

elastickv_pebble_compact_estimated_debt_bytes is emitted per raft group (it carries a group label), but this panel reduces with max by (node_id). On multi-group nodes, that drops all but the largest group’s debt (for example, two 400 MiB groups still render 400 MiB), so the "per node" debt stat/trend under-reports real backlog and can mask compaction pressure. Use sum by (node_id) (or explicitly relabel this as worst-group debt).

Useful? React with 👍 / 👎.

"editorMode": "code",
"expr": "elastickv_pebble_fsm_apply_sync_mode{job=\"elastickv\",node_id=~\"$node_id\",mode=\"nosync\"}",
"legendFormat": "{{node_id}} nosync",
"expr": "max by (node_id) (elastickv_pebble_compact_in_progress{job=\"elastickv\",node_id=~\"$node_id\"})",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Aggregate compactions-in-progress across groups

This query is described as "current number of compactions running per node," but max by (node_id) only keeps the busiest single group on each node. If a node runs compactions in several groups at once, the panel undercounts (e.g., 2 + 2 active compactions appears as 2), which can hide node-level compaction saturation. This should be sum by (node_id) for a true per-node count, or the panel text should be changed to worst-group semantics.

Useful? React with 👍 / 👎.

Codex P2 follow-up on db6115f: same aggregation bug applied to the
Compaction section. elastickv_pebble_compact_estimated_debt_bytes and
elastickv_pebble_compact_in_progress are both emitted per raft group,
but the panels reduced with max by (node_id). On multi-group nodes:

- Two groups each holding 400 MiB of compaction debt rendered as
  400 MiB instead of 800 MiB, hiding the real per-node backlog.
- Two groups each running 2 compactions rendered as 2 instead of 4,
  hiding node-level compaction saturation.

Switched both stat and timeseries panels for both metrics to
sum by (node_id). Updated all four descriptions to spell out the
aggregation choice and the (per-group max × group-count) ceiling
operators should expect, so a future reader can tell the change is
intentional.

The L0 sublevels / num_files panels remain on max by (node_id) —
those are per-group stall-risk signals where the worst group is what
matters; summing would mask backpressure on the specific group being
throttled.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 24, 2026

Addressed Codex P2 (x2) follow-up in b16137e:

Compaction debt and compactions-in-progress now sum across groups per node. Same per-group aggregation bug as the memtable panels fixed in db6115f: both metrics carry a group label, and max by (node_id) dropped all but the busiest group. On multi-group nodes that meant two groups each holding 400 MiB of debt rendered as 400 MiB instead of 800 MiB, and two groups each running two compactions rendered as 2 instead of 4 — hiding per-node compaction pressure the panels were specifically meant to surface.

Switched both stat + timeseries panels for compact_estimated_debt_bytes and compact_in_progress to sum by (node_id). Descriptions updated to spell out the aggregation and the (per-group max × group-count) ceiling an operator should expect.

Still leaving max by (node_id) on the L0 sublevels / num_files panels — those are per-group stall-risk signals where the worst group is what matters, and summing there would dilute exactly the backpressure signal that drives write-stall investigation.

/gemini review
@codex review

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Another round soon, please!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant