Skip to content

feat(consensus): implement metrics for consensus#470

Open
iamquang95 wants to merge 2 commits into
mainfrom
iamquang95/consensus-metrics
Open

feat(consensus): implement metrics for consensus#470
iamquang95 wants to merge 2 commits into
mainfrom
iamquang95/consensus-metrics

Conversation

@iamquang95

@iamquang95 iamquang95 commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

Fix #62

Stack on consensus/qbft

Base automatically changed from iamquang95/consensus-qbft to main June 12, 2026 07:15
@iamquang95 iamquang95 force-pushed the iamquang95/consensus-metrics branch from e7b28c9 to ee8ac59 Compare June 12, 2026 07:50
@iamquang95

Copy link
Copy Markdown
Collaborator Author

@claude review this PR

@claude

claude Bot commented Jun 12, 2026

Copy link
Copy Markdown

Claude finished @iamquang95's task in 3m 4s —— View job


Review: feat(consensus): implement metrics for consensus

  • Gather context (read changed files + Go reference)
  • Review consensus metrics implementation
  • Post review feedback

Summary

Clean, faithful port of Charon's core/consensus/metrics package and its QBFT call sites. Metric names, subsystem prefix (core_consensus), labels, histogram buckets, and help text all match charon/core/consensus/metrics/metrics.go, and the #[vise::register] Global<…> registration follows the established repo convention (tracking, p2p, peerinfo, relay-server, core/tracker). The signed-duration handling is a nice catch for Go-parity. I have no blocking concerns — findings below are Low/nit.

Parity matrix

Component Go (metrics.go / qbft.go) Rust Match
Prefix core + consensus prefix = "core_consensus"
decided_rounds (gauge, [protocol,duty,timer])
decided_leader_index (gauge, [protocol,duty])
duration_seconds buckets .01…3,5 (17) identical 17
timeout_total / error_total
proposedAt capture point after value/hash/verify send, before MaybeStart (qbft.go:433) runner.rs:153, same point
Duration observed only on Propose path (not Participate) qbft.go:435 defer runner.rs:163 in propose only
Duration observed even when instance already running (!MaybeStart) qbft.go:446 + defer runner.rs:157-163
Timeout on !decided after Run qbft.go:603-605 ContextCanceled && !decided (runner.rs:422-428)
Error on non-context Run error qbft.go:596-598 Err(err) => inc + Error::Core + transport-error branch ✅ (see L1)
Signed duration when decidedAt < proposedAt time.Duration is signed (qbft.go:439) duration_seconds negative fallback

Findings

  • [Low] decided.store(true) is unconditional while the decided-at send + decision metrics are gated on !qcommit.is_empty()
    Evidence: crates/consensus/src/qbft/runner.rs:330-334
    Go reference: charon/core/consensus/qbft/qbft.go:548-563 (round := qcommit[0].Round(), decided = true and inst.DecidedAtCh <- … all unconditional).
    Impact: in the (in-practice impossible) empty-qcommit case, Rust would set decided = true but record neither the leader/round metrics nor the decided timestamp — so the instance resolves as Ok with no duration observed and no decision metrics. The empty-guard is good defensive code, but it's internally inconsistent with the unconditional decided.store. Consider either moving decided.store(true) inside the same guard, or dropping the guard (qbft only invokes the decide callback with a non-empty quorum-commit set, which is why Go indexes qcommit[0] directly). Purely a consistency nit — harmless today.

  • [Low / informational] Error counting on the transport-error branch
    Evidence: crates/consensus/src/qbft/runner.rs:408-415
    In Go a broadcast failure surfaces as a non-context error from qbft.Run and is counted by IncConsensusError (qbft.go:598). In Rust the broadcast error is captured into transport_error and the core returns ContextCanceled (runner.rs:375), so incrementing error_total in the transport_error branch is the correct parity equivalent and avoids a miscount as a timeout. ✅ Just calling it out as a deliberate, correct divergence — no change needed.

Minor nits

  • crates/consensus/src/metrics.rs:13-14: the ProtocolDutyTimerLabels / ProtocolDutyLabels tuple-type aliases are a clean touch; labels() helper only covers the 3-tuple while set_decided_leader_index inlines the 2-tuple — fine as-is, optional to add a labels2() for symmetry.
  • The metrics.rs unit tests assert exact serialized output including bucket le values — good, high-signal coverage that locks in name/label/bucket parity.

Tests

Not run — cargo test/cargo clippy require interactive approval in this environment, so I could not execute them. Please confirm locally:

cargo test -p consensus --all-features
cargo clippy --workspace --all-targets --all-features -- -D warnings

The five metrics.rs tests plus duration_seconds_preserves_signed_order and the instance.rs channel tests look correct by inspection.

Open questions / assumptions

  1. Confirmed Participate does not observe duration in Go — Rust matches by only calling observe_qbft_consensus_duration from propose. ✅
  2. decided_at channel cap is 1 and take_decided_at_rx is reached only after the mark_proposed guard, so the receiver is taken at most once per duty — no double-take. ✅
    · branch iamquang95/consensus-metrics

@iamquang95 iamquang95 marked this pull request as ready for review June 12, 2026 09:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement core/consensus/metrics

1 participant