Skip to content

[improve][broker] Add reset cursor latency metric#26088

Open
Technoboy- wants to merge 2 commits into
apache:masterfrom
Technoboy-:codex/reset-cursor-latency-pr2
Open

[improve][broker] Add reset cursor latency metric#26088
Technoboy- wants to merge 2 commits into
apache:masterfrom
Technoboy-:codex/reset-cursor-latency-pr2

Conversation

@Technoboy-

Copy link
Copy Markdown
Contributor

Motivation

Follow-up for investigating slow subscription cursor resets. When reset cursor is slow or fails, brokers currently lack a direct latency metric for the operation, and the binary protocol seek error path can dereference a missing cause while building the error response.

Modifications

  • Add pulsar_subscription_reset_cursor_latency_ms summary with topic, subscription, and result labels.
  • Record reset cursor latency for both message-id and timestamp reset paths, including early fenced failures.
  • Make broker seek error responses null-safe by unwrapping completion exceptions and falling back when the message is missing.
  • Recommend enabling managedLedgerReadEntryTimeoutSeconds by setting the broker/standalone/template defaults to 120 seconds and documenting 0 as the disable value.

Verifying this change

  • ./gradlew :pulsar-broker-common:compileJava :pulsar-broker:compileJava
  • ./gradlew :pulsar-broker:test --tests org.apache.pulsar.broker.service.SubscriptionSeekTest.testSeek --tests org.apache.pulsar.broker.service.SubscriptionSeekTest.testSeekByTimestamp --tests org.apache.pulsar.broker.service.SubscriptionSeekTest.testConcurrentResetCursor

@Technoboy- Technoboy- marked this pull request as ready for review June 25, 2026 09:55
@Technoboy- Technoboy- requested a review from Copilot June 25, 2026 09:56
@Technoboy- Technoboy- self-assigned this Jun 25, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves broker observability and robustness around subscription cursor resets by adding a Prometheus latency metric for reset-cursor operations, hardening the binary protocol seek error response path against null causes, and enabling a managed-ledger read-entry timeout by default to reduce the risk of stuck reads blocking cursor operations.

Changes:

  • Added pulsar_subscription_reset_cursor_latency_ms Summary metric (labels: topic, subscription, result) and recorded latency for both timestamp- and position-based reset cursor paths, including early fenced failures.
  • Made seek/reset-cursor error responses safer by unwrapping completion exceptions and using a null-safe message fallback.
  • Changed managedLedgerReadEntryTimeoutSeconds default to 120 seconds and updated template configs/comments to recommend keeping it enabled (with 0 as the disable value).

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java Unwraps completion exceptions in seek error handling and adds null-safe message formatting.
pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentSubscription.java Introduces and records a Prometheus Summary metric for reset cursor latency (success/failure).
pulsar-broker-common/src/main/java/org/apache/pulsar/broker/ServiceConfiguration.java Updates the default managed-ledger read-entry timeout and expands its documentation.
deployment/terraform-ansible/templates/broker.conf Updates the default managed-ledger read-entry timeout and documents 0 as disable.
conf/standalone.conf Updates the default managed-ledger read-entry timeout and documents 0 as disable.
conf/broker.conf Updates the default managed-ledger read-entry timeout and documents 0 as disable.

Comment thread pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java Outdated
Comment thread pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java Outdated
@void-ptr974

Copy link
Copy Markdown
Contributor

This metric may leak high-cardinality label children.

RESET_CURSOR_LATENCY.labels(topicName, subName, result) creates a Summary child per topic/subscription/result, and each Summary.Child also registers a DataSketchesSummaryLogger in the static rotation list. I don't see any cleanup when the subscription/topic is closed or deleted, so subscriptions that ever perform reset cursor can remain in the collector and rotation list indefinitely. On brokers with many short-lived topics/subscriptions, this can grow memory usage, periodic rotation cost, and Prometheus series count.

Could we either avoid topic/subscription labels for this static summary, or record this through subscription-scoped stats with lifecycle cleanup? If keeping these labels is required, we should add cleanup on subscription/topic deletion that removes both the collector child and its associated summary logger.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants