Skip to content

feat(backup_request): add rate-limited backup request policy (#3228)#3229

Open
feng-y wants to merge 9 commits intoapache:masterfrom
feng-y:feat/rate-limited-backup-request
Open

feat(backup_request): add rate-limited backup request policy (#3228)#3229
feng-y wants to merge 9 commits intoapache:masterfrom
feng-y:feat/rate-limited-backup-request

Conversation

@feng-y
Copy link
Contributor

@feng-y feng-y commented Feb 25, 2026

Summary

Add ratio-based rate limiting for backup requests to prevent backup request storms under high QPS or downstream latency spikes.

Design

  • BackupRequestPolicy interface unchanged (ABI stable)
  • BackupRateLimiter: standalone statistics module tracking backup/total ratio within a sliding time window using bvar::Window<bvar::Adder> + relaxed atomics for lock-free best-effort throttling
  • RateLimitedBackupPolicy: internal implementation composing BackupRateLimiter, hidden in .cpp
  • RateLimitedBackupPolicyOptions: public config struct with sensible defaults
  • CreateRateLimitedBackupPolicy(opts): public factory function (validates inputs, returns NULL on error)
  • Ownership: caller owns the returned pointer; use unique_ptr or set on ChannelOptions.backup_request_policy

Configuration (RateLimitedBackupPolicyOptions)

Field Default Description
backup_request_ms -1 Timeout in ms; -1 inherits from ChannelOptions.backup_request_ms (only when set via ChannelOptions). Must be >= -1.
max_backup_ratio 0.1 Max ratio of backup decisions to total RPC legs; range (0, 1]. 1.0 = no effective limit.
window_size_seconds 10 Sliding window width; range [1, 3600]
update_interval_seconds 5 Cached-ratio refresh interval; must be >= 1

Edge cases handled

  • Cold start (no traffic yet): ratio defaults to 0.0 → backups allowed freely
  • Latency spike (backups issued but no completions yet): conservative ratio=1.0 → suppresses further backups until an RPC leg completes
  • max_backup_ratio > 1.0: rejected by factory (returns NULL with LOG(ERROR))
  • Factory returns NULL on invalid params; caller must check

Rate-limiting semantics

The denominator counts all completed RPC legs (original + backup), not user-visible RPCs. For max_backup_ratio=R, the effective fraction of user RPCs receiving a backup is approximately R/(1-R) (e.g. ~11% for R=0.1). The cached ratio refreshes at most once per update_interval_seconds, so short bursts within that window may transiently exceed the limit.

Closes #3228

Usage

brpc::RateLimitedBackupPolicyOptions opts;
opts.backup_request_ms = 100;    // or -1 to inherit from ChannelOptions
opts.max_backup_ratio  = 0.1;    // suppress backups when >10% of legs are backups

std::unique_ptr<brpc::BackupRequestPolicy> policy(
    brpc::CreateRateLimitedBackupPolicy(opts));

brpc::ChannelOptions options;
options.backup_request_ms     = 100;
options.backup_request_policy = policy.get();  // channel does NOT take ownership
channel.Init("...", &options);
// policy must outlive channel

Test plan

  • 7 invalid-parameter tests (backup_request_ms, max_backup_ratio, window_size, update_interval)
  • ValidMinusOneBackupRequestMsInherits: factory succeeds, GetBackupRequestMs returns -1
  • ValidMaxRatioAtBoundary: max_backup_ratio=1.0 accepted, cold-start allows first backup
  • ColdStartAllowsBackup: first DoBackup() always returns true
  • AfterColdStartBackupSuppressedUntilRpcCompletes: after 1.2s interval, DoBackup() returns false (conservative ratio=1.0)

@feng-y
Copy link
Contributor Author

feng-y commented Feb 25, 2026

@copilot review this PR

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a rate-limiting mechanism for backup requests, aiming to prevent “backup request storms” by suppressing backups when the backup/total ratio exceeds a configurable threshold over a sliding time window.

Changes:

  • Introduces ChannelOptions.backup_request_max_ratio and creates an internal rate-limited BackupRequestPolicy when enabled.
  • Adds global gflags for ratio threshold, window size, and ratio cache update interval, backed by bvar::Window + cumulative counters for cold start.
  • Adds a unit test covering policy creation/precedence plus a timing-based functional check.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
test/brpc_channel_unittest.cpp Adds coverage for rate-limited backup request behavior and configuration precedence.
src/brpc/channel.h Adds per-channel configuration and internal ownership for an auto-created backup policy.
src/brpc/channel.cpp Creates and wires an internal rate-limited backup policy during channel initialization.
src/brpc/backup_request_policy.h Refactors BackupRequestPolicy into a concrete base with optional rate limiting support.
src/brpc/backup_request_policy.cpp Implements rate limiter, gflags, and default policy behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

src/brpc/backup_request_policy.cpp:120

  • When compare_exchange_strong fails on line 105, the code continues to use the potentially stale ratio value loaded on line 102, even though another thread may have updated _cached_ratio on line 116. While the comment acknowledges this is "best-effort rate limiting," consider reloading _cached_ratio after a failed compare_exchange to reduce the window of staleness, especially in high-concurrency scenarios where the stale value could lead to temporarily incorrect rate limiting decisions.
    bool DoBackup(const Controller* /*controller*/) const override {
        const int64_t now_us = butil::cpuwide_time_us();
        int64_t last_us = _last_update_us.load(butil::memory_order_relaxed);
        double ratio = _cached_ratio.load(butil::memory_order_relaxed);

        if (now_us - last_us >= _update_interval_us) {
            if (_last_update_us.compare_exchange_strong(
                    last_us, now_us, butil::memory_order_relaxed)) {
                int64_t total = _total_window.get_value();
                int64_t backup = _backup_window.get_value();
                // Fall back to cumulative counts when the window has no
                // sampled data yet (cold-start within the first few seconds).
                if (total <= 0) {
                    total = _total_count.get_value();
                    backup = _backup_count.get_value();
                }
                ratio = (total > 0) ? static_cast<double>(backup) / total : 0.0;
                _cached_ratio.store(ratio, butil::memory_order_relaxed);
            }
        }

        return ratio < _max_backup_ratio;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@feng-y feng-y force-pushed the feat/rate-limited-backup-request branch 3 times, most recently from 7b50fdc to e6db035 Compare February 25, 2026 11:19
@feng-y feng-y requested a review from Copilot February 25, 2026 11:22
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

test/brpc_channel_unittest.cpp:2856

  • The comment says “ratio refreshes on every DoBackup call” but with FLAGS_backup_request_ratio_update_interval_s = 1 the ratio cache can only refresh at most once per second. This makes the test rationale misleading when debugging failures; please adjust the comment to match the actual behavior.
    // Test 5: Functional — rate limiting reduces backup requests.
    // Use short update interval so ratio refreshes on every DoBackup call.
    // Save/restore gflags outside the inner scope so cleanup always runs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@feng-y feng-y force-pushed the feat/rate-limited-backup-request branch 3 times, most recently from 4ed6666 to e614e33 Compare February 25, 2026 11:43
@feng-y feng-y requested a review from Copilot February 25, 2026 12:10
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (4)

src/brpc/channel.cpp:284

  • Consider checking if CreateRateLimitedBackupPolicy returns NULL before setting _owns_backup_policy to true. While the current guards (line 278 checks max_ratio > 0, validators ensure gflags are valid) should prevent NULL returns in practice, defensive programming suggests verifying the return value and only setting _owns_backup_policy if the policy was successfully created. This would make the code more robust against future changes to CreateRateLimitedBackupPolicy or the validators.
        _options.backup_request_policy = CreateRateLimitedBackupPolicy(
            _options.backup_request_ms, max_ratio,
            FLAGS_backup_request_ratio_window_size_s,
            FLAGS_backup_request_ratio_update_interval_s);
        _owns_backup_policy = true;

test/brpc_channel_unittest.cpp:2868

  • The gflag FLAGS_backup_request_max_ratio is saved and restored but never actually modified in this test. The test uses opt.backup_request_max_ratio = 0.3 (per-channel config) instead of the global gflag. The save/restore of saved_ratio at lines 2862 and 2868 is unnecessary and could be removed for clarity.
        const double saved_ratio = brpc::FLAGS_backup_request_max_ratio;
        brpc::FLAGS_backup_request_ratio_update_interval_s = 1;
        brpc::FLAGS_backup_request_ratio_window_size_s = 2;
        BRPC_SCOPE_EXIT {
            brpc::FLAGS_backup_request_ratio_update_interval_s = saved_interval;
            brpc::FLAGS_backup_request_ratio_window_size_s = saved_window;
            brpc::FLAGS_backup_request_max_ratio = saved_ratio;

src/brpc/backup_request_policy.h:48

  • The documentation states "The caller owns the returned pointer" but this is inconsistent with actual usage in channel.cpp where the Channel owns the policy when created via backup_request_max_ratio. Consider clarifying: "The caller owns the returned pointer and must delete it when done. However, when Channel creates this policy internally via backup_request_max_ratio, Channel manages the lifetime automatically."
// The caller owns the returned pointer.

test/brpc_channel_unittest.cpp:2853

  • Missing test coverage for backup_request_max_ratio = 0 (explicitly disabled). While -1 means "use gflag default", 0 should explicitly disable rate limiting even if the gflag is set to enable it. Consider adding a test case that sets the gflag to a positive value, then creates a channel with opt.backup_request_max_ratio = 0, and verifies that no policy is created.
    // Test 4: No policy when backup_request_ms < 0 (backup disabled)
    {
        brpc::ChannelOptions opt;
        opt.backup_request_ms = -1;
        opt.backup_request_max_ratio = 0.3;
        brpc::Channel channel;
        ASSERT_EQ(0, channel.Init(_ep, &opt));
        ASSERT_TRUE(channel.options().backup_request_policy == NULL);
    }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@feng-y feng-y requested a review from Copilot February 25, 2026 12:18
@feng-y feng-y force-pushed the feat/rate-limited-backup-request branch from e614e33 to 27fcaee Compare February 25, 2026 12:21
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (3)

test/brpc_channel_unittest.cpp:2831

  • Missing test coverage: The test doesn't cover the edge case where backup_request_max_ratio = 0. According to the PR description, 0 should explicitly disable rate limiting. A test case similar to Test 2 but with max_ratio = 0 instead of -1 would verify this behavior and ensure both forms of disabling work correctly.
    // Test 2: No policy when ratio is disabled (-1)
    {
        brpc::ChannelOptions opt;
        opt.backup_request_ms = 10;
        opt.backup_request_max_ratio = -1;
        brpc::Channel channel;
        ASSERT_EQ(0, channel.Init(_ep, &opt));
        ASSERT_TRUE(channel.options().backup_request_policy == NULL);
    }

test/brpc_channel_unittest.cpp:2831

  • Missing test coverage: The test doesn't cover the special case where backup_request_max_ratio = 1.0. According to the implementation (backup_request_policy.cpp:106-107), values >= 1.0 are treated as "no limit" since the ratio cannot exceed 1.0. A test should verify that setting max_ratio to 1.0 allows all backup requests, similar to having no rate limiting.
    // Test 2: No policy when ratio is disabled (-1)
    {
        brpc::ChannelOptions opt;
        opt.backup_request_ms = 10;
        opt.backup_request_max_ratio = -1;
        brpc::Channel channel;
        ASSERT_EQ(0, channel.Init(_ep, &opt));
        ASSERT_TRUE(channel.options().backup_request_policy == NULL);
    }

test/brpc_channel_unittest.cpp:2831

  • Missing test coverage: The test doesn't cover the case where backup_request_max_ratio > 1.0. According to channel.cpp:273-276, values > 1.0 should be clamped to 1.0 with a warning. A test should verify this clamping behavior works correctly and that the policy is still created with the clamped value.
    // Test 2: No policy when ratio is disabled (-1)
    {
        brpc::ChannelOptions opt;
        opt.backup_request_ms = 10;
        opt.backup_request_max_ratio = -1;
        brpc::Channel channel;
        ASSERT_EQ(0, channel.Init(_ep, &opt));
        ASSERT_TRUE(channel.options().backup_request_policy == NULL);
    }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@feng-y feng-y force-pushed the feat/rate-limited-backup-request branch from 27fcaee to abe44a3 Compare February 25, 2026 12:28
@feng-y feng-y requested a review from Copilot February 25, 2026 12:32
@feng-y feng-y force-pushed the feat/rate-limited-backup-request branch 2 times, most recently from d9a2e0d to efd2f73 Compare February 25, 2026 12:38
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@feng-y feng-y force-pushed the feat/rate-limited-backup-request branch from efd2f73 to 95af5c8 Compare February 25, 2026 12:47
@feng-y
Copy link
Contributor Author

feng-y commented Feb 28, 2026

@wwbmmm Pipeline failed: some packages failed to download.

@wwbmmm
Copy link
Contributor

wwbmmm commented Mar 1, 2026

@wwbmmm Pipeline failed: some packages failed to download.

Please merge master branch which fixed CI failure.

@feng-y feng-y force-pushed the feat/rate-limited-backup-request branch from e840999 to 3243190 Compare March 1, 2026 11:27
…e guidance

- Promote built-in factory function to its own subsection (before custom interface)
- Add unique_ptr usage example for policy lifetime management
- Add RateLimitedBackupPolicyOptions parameter table with defaults/constraints
- Document NULL return on invalid params
- Keep cn/en docs in sync
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

feng-y added 2 commits March 1, 2026 22:10
…nts, tests

- controller.cpp: When policy returns -1 (inherit sentinel), fall back
  to _backup_request_ms set from ChannelOptions, so backup timer is
  actually armed when using a policy with backup_request_ms=-1.
- backup_request_policy.cpp: Clarify OnRPCEnd comment to say 'RPC legs'
  (both original and backup completions counted as denominator).
- backup_request_policy.cpp: Warn when update_interval_seconds exceeds
  window_size_seconds (window would rarely refresh within its period).
- backup_request_policy.h: Fix comment typo ('Called when an RPC ends').
- brpc_channel_unittest.cpp: Replace nullptr with NULL to match codebase
  convention; use ASSERT_TRUE(p != NULL) for unique_ptr null checks.
- brpc_channel_unittest.cpp: Add ValidMaxRatioAtBoundary behavioral assert
  and AfterColdStartBackupSuppressedUntilRpcCompletes test.
…test

- docs: fix backup_request_ms default (0→-1) and constraint (>=0→>=-1);
  add note that -1 inherit only works via ChannelOptions injection path,
  not Controller::set_backup_request_policy().
- test: replace no-op AfterColdStart test with a real behavioral assertion:
  after cold-start backup fires, wait 1.2s for ratio refresh, verify
  DoBackup() returns false (conservative ratio=1.0 path triggers).
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

test/brpc_channel_unittest.cpp:3151

  • The explanation for the cold-start behavior here doesn’t match the current limiter logic: on true cold start (total==0 and backup==0) the computed ratio is 0.0 (allow), and only after backups are counted and the update interval elapses can the conservative ratio=1.0 path kick in (total==0, backup>0). Updating these comments to match the actual transitions would make the test easier to understand/maintain.
    // With max_backup_ratio=1.0, backup is allowed until backup_count/total
    // reaches 1.0. At cold start (total=0) the conservative path applies:
    // ratio is treated as 1.0, so DoBackup() returns false only after the
    // first backup is counted.
    ASSERT_TRUE(p->DoBackup(NULL));   // cold start: no total yet, allow first

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

Comments suppressed due to low confidence (1)

test/brpc_channel_unittest.cpp:3152

  • The explanatory comment about cold start here doesn’t match the implementation in BackupRateLimiter::ShouldAllow(): when total==0 and backup==0, the code treats this as “true cold start” and sets ratio=0.0 (allow freely). The conservative ratio=1.0 path only applies after backups have been counted but totals are still 0. Consider updating the comment to reflect the actual behavior to avoid confusion when maintaining the test.
    // With max_backup_ratio=1.0, backup is allowed until backup_count/total
    // reaches 1.0. At cold start (total=0) the conservative path applies:
    // ratio is treated as 1.0, so DoBackup() returns false only after the
    // first backup is counted.
    ASSERT_TRUE(p->DoBackup(NULL));   // cold start: no total yet, allow first
}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

feng-y added 2 commits March 1, 2026 22:50
…CEnd comment, re-allow test, docs

- controller.cpp: treat -1 specifically (not all negatives) as the inherit
  sentinel; other negatives still disable backup, preserving old behavior
  for custom policies that return negative values to disable backup
- backup_request_policy.h: document the -1 sentinel contract on
  GetBackupRequestMs() so custom implementors know the new interface
- backup_request_policy.cpp: fix OnRPCEnd comment — called once per
  user-level RPC, not once per leg (total_count tracks user RPCs)
- test: add OnRPCEndDrivesRatioDownAndReAllows — fires 20 backups to
  suppress, then completes 50 RPCs via OnRPCEnd, verifies DoBackup
  re-allows once ratio refreshes below max_backup_ratio
- docs (EN+CN): rephrase backup_request_ms=-1 note to clarify the
  channel-level fallback only applies when set via ChannelOptions
Plain new follows brpc's project-wide OOM convention (abort rather than
return NULL). The factory's NULL return already exclusively signals invalid
parameters, not allocation failure — adding std::nothrow would conflate
the two. Comment added to suppress future linter/AI suggestions.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…oyed before policy

The unique_ptr comment was ambiguous: 'released when goes out of scope,
as long as it outlives the channel' can be read as contradictory. Reword
to make the ordering explicit: destroy channel first, then policy.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (1)

test/brpc_channel_unittest.cpp:3150

  • The comment describing the cold-start behavior is inaccurate: when total==0 and backup==0, BackupRateLimiter::ShouldAllow() treats the ratio as 0.0 (true cold start), not 1.0 (the conservative path only applies when backup>0 && total==0). Please update the comment so it matches the implementation and doesn’t mislead future readers.
    // With max_backup_ratio=1.0, backup is allowed until backup_count/total
    // reaches 1.0. At cold start (total=0) the conservative path applies:
    // ratio is treated as 1.0, so DoBackup() returns false only after the
    // first backup is counted.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…tioAtBoundary

ratio=1.0 conservative path only applies when backup>0 && total==0.
True cold start (both zero) sets ratio=0.0 and allows freely.
Copy link
Contributor

@chenBright chenBright left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@feng-y
Copy link
Contributor Author

feng-y commented Mar 2, 2026

@wwbmmm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request] 为 backup request 增加比例限制机制

4 participants