Skip to content

[server] fix stop replica deletion stuck when TabletServer is offline#3391

Open
gyang94 wants to merge 4 commits into
apache:mainfrom
gyang94:per-sender-retry
Open

[server] fix stop replica deletion stuck when TabletServer is offline#3391
gyang94 wants to merge 4 commits into
apache:mainfrom
gyang94:per-sender-retry

Conversation

@gyang94
Copy link
Copy Markdown
Contributor

@gyang94 gyang94 commented May 27, 2026

Purpose

Linked issue: close #3357

Brief change log

Summary

When a stopReplica RPC fails due to transient network issues or a TabletServer crash, the Coordinator has no reliable retry mechanism. This causes replicas to get stuck and table deletion to never complete, resulting in the tableCount metric never decreasing.

This PR introduces a per-TabletServer sender thread model (aligned with Kafka's ControllerChannelManager / RequestSendThread) and a new ReplicaDeletionIneligible state. These changes provide robust retry and pause/resume semantics for replica deletion.

Changes

Core: Per-TS Sender Thread (ControlRequestSendThread)

  • Dedicated Sender Thread: Each TabletServer gets a dedicated sender thread with a FIFO queue.
  • New Replica State: Introduced a state for replicas whose deletion cannot proceed (e.g., TS offline or returned a business error).
  • Resume Logic: TableManager.resumeDeletions() implements 3-step logic:
    1. Complete if all replicas succeeded.
    2. Retry previously-ineligible replicas on alive TSes.
    3. Re-fire eligible tables.
  • Auto-Resume on Reconnect: processNewTabletServer() clears ineligible marks and triggers resumeDeletions(), so paused deletions automatically resume when a TS reconnects.
  • Handle Dead TS: processDeadTabletServer() transitions in-flight deletion replicas to ineligible.

Config

  • coordinator.request.retry.backoff: Backoff between retries (default: 100ms).
  • coordinator.request.timeout: RPC timeout per attempt (default: 30s).

️ What was removed

  • retryDeleteAndSuccessDeleteReplicas(): The old "retry-N-then-force-success" mechanism.
  • failDeleteNumbers tracking and DELETE_TRY_TIMES constant.
  • Direct RPC calls from CoordinatorRequestBatch (replaced by queue-based dispatch).

Tests

API and Format

Documentation

@gyang94 gyang94 force-pushed the per-sender-retry branch from 87952c7 to caaaebb Compare May 28, 2026 07:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[server] Table deletion stuck permanently when StopReplica request fails

1 participant