feat(consensus): implement metrics for consensus#470
Conversation
e7b28c9 to
ee8ac59
Compare
|
@claude review this PR |
|
Claude finished @iamquang95's task in 3m 4s —— View job Review:
|
| Component | Go (metrics.go / qbft.go) |
Rust | Match |
|---|---|---|---|
| Prefix | core + consensus |
prefix = "core_consensus" |
✅ |
decided_rounds (gauge, [protocol,duty,timer]) |
✅ | ✅ | ✅ |
decided_leader_index (gauge, [protocol,duty]) |
✅ | ✅ | ✅ |
duration_seconds buckets |
.01…3,5 (17) |
identical 17 | ✅ |
timeout_total / error_total |
✅ | ✅ | ✅ |
proposedAt capture point |
after value/hash/verify send, before MaybeStart (qbft.go:433) |
runner.rs:153, same point |
✅ |
| Duration observed only on Propose path (not Participate) | qbft.go:435 defer |
runner.rs:163 in propose only |
✅ |
Duration observed even when instance already running (!MaybeStart) |
qbft.go:446 + defer |
runner.rs:157-163 |
✅ |
Timeout on !decided after Run |
qbft.go:603-605 |
ContextCanceled && !decided (runner.rs:422-428) |
✅ |
| Error on non-context Run error | qbft.go:596-598 |
Err(err) => inc + Error::Core + transport-error branch |
✅ (see L1) |
Signed duration when decidedAt < proposedAt |
time.Duration is signed (qbft.go:439) |
duration_seconds negative fallback |
✅ |
Findings
-
[Low]
decided.store(true)is unconditional while the decided-at send + decision metrics are gated on!qcommit.is_empty()
Evidence:crates/consensus/src/qbft/runner.rs:330-334
Go reference:charon/core/consensus/qbft/qbft.go:548-563(round := qcommit[0].Round(),decided = trueandinst.DecidedAtCh <- …all unconditional).
Impact: in the (in-practice impossible) empty-qcommitcase, Rust would setdecided = truebut record neither the leader/round metrics nor the decided timestamp — so the instance resolves asOkwith no duration observed and no decision metrics. The empty-guard is good defensive code, but it's internally inconsistent with the unconditionaldecided.store. Consider either movingdecided.store(true)inside the same guard, or dropping the guard (qbft only invokes the decide callback with a non-empty quorum-commit set, which is why Go indexesqcommit[0]directly). Purely a consistency nit — harmless today. -
[Low / informational] Error counting on the transport-error branch
Evidence:crates/consensus/src/qbft/runner.rs:408-415
In Go a broadcast failure surfaces as a non-context error fromqbft.Runand is counted byIncConsensusError(qbft.go:598). In Rust the broadcast error is captured intotransport_errorand the core returnsContextCanceled(runner.rs:375), so incrementingerror_totalin thetransport_errorbranch is the correct parity equivalent and avoids a miscount as a timeout. ✅ Just calling it out as a deliberate, correct divergence — no change needed.
Minor nits
crates/consensus/src/metrics.rs:13-14: theProtocolDutyTimerLabels/ProtocolDutyLabelstuple-type aliases are a clean touch;labels()helper only covers the 3-tuple whileset_decided_leader_indexinlines the 2-tuple — fine as-is, optional to add alabels2()for symmetry.- The
metrics.rsunit tests assert exact serialized output including bucketlevalues — good, high-signal coverage that locks in name/label/bucket parity.
Tests
Not run — cargo test/cargo clippy require interactive approval in this environment, so I could not execute them. Please confirm locally:
cargo test -p consensus --all-features
cargo clippy --workspace --all-targets --all-features -- -D warningsThe five metrics.rs tests plus duration_seconds_preserves_signed_order and the instance.rs channel tests look correct by inspection.
Open questions / assumptions
- Confirmed Participate does not observe duration in Go — Rust matches by only calling
observe_qbft_consensus_durationfrompropose. ✅ decided_atchannel cap is 1 andtake_decided_at_rxis reached only after themark_proposedguard, so the receiver is taken at most once per duty — no double-take. ✅
· branchiamquang95/consensus-metrics
Fix #62
Stack on consensus/qbft