Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 75 additions & 1 deletion docs/cn/backup_request.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Channel开启backup request。这个Channel会先向其中一个server发送请

示例代码见[example/backup_request_c++](https://github.com/apache/brpc/blob/master/example/backup_request_c++)。这个例子中,client设定了在2ms后发送backup request,server在碰到偶数位的请求后会故意睡眠20ms以触发backup request。

运行后,client端和server端的日志分别如下,index是请求的编号。可以看到server端在收到第一个请求后会故意sleep 20ms,client端之后发送另一个同样index的请求,最终的延时并没有受到故意sleep的影响。
运行后,client端和server端的日志分别如下,"index"是请求的编号。可以看到server端在收到第一个请求后会故意sleep 20ms,client端之后发送另一个同样index的请求,最终的延时并没有受到故意sleep的影响。

![img](../images/backup_request_1.png)

Expand Down Expand Up @@ -39,6 +39,80 @@ my_func_latency << tm.u_elapsed(); // u代表微秒,还有s_elapsed(), m_elap
// 好了,在/vars中会显示my_func_qps, my_func_latency, my_func_latency_cdf等很多计数器。
```
## Backup Request 限流
如需限制 backup request 的发送比例,可使用内置工厂函数创建限流策略,也可自行实现 `BackupRequestPolicy` 接口。
优先级顺序:`backup_request_policy` > `backup_request_ms`。
### 使用内置限流策略
调用 `CreateRateLimitedBackupPolicy` 创建限流策略,并将其设置到 `ChannelOptions.backup_request_policy`:
```c++
#include "brpc/backup_request_policy.h"
#include <memory>
brpc::RateLimitedBackupPolicyOptions opts;
opts.backup_request_ms = 10; // 超过10ms未返回时发送backup请求
opts.max_backup_ratio = 0.3; // backup请求比例上限30%
opts.window_size_seconds = 10; // 滑动窗口宽度(秒)
opts.update_interval_seconds = 5; // 缓存比例的刷新间隔(秒)
// CreateRateLimitedBackupPolicy返回的指针由调用方负责释放。
// policy的生命周期必须长于channel——先销毁channel,再销毁policy。
std::unique_ptr<brpc::BackupRequestPolicy> policy(
brpc::CreateRateLimitedBackupPolicy(opts));
brpc::ChannelOptions options;
options.backup_request_policy = policy.get(); // Channel不拥有该对象
channel.Init(..., &options);
// channel必须在policy析构之前销毁。
```

参数说明(`RateLimitedBackupPolicyOptions`):

| 字段 | 默认值 | 说明 |
|------|--------|------|
| `backup_request_ms` | -1 | 超时阈值(毫秒)。-1 表示继承 `ChannelOptions.backup_request_ms`(仅在通过 `ChannelOptions.backup_request_policy` 设置策略时有效;通过 Controller 注入时没有 channel 级的回退值,应显式指定 >= 0 的值)。必须 >= -1。 |
| `max_backup_ratio` | 0.1 | backup比例上限,取值范围 (0, 1] |
| `window_size_seconds` | 10 | 滑动窗口宽度(秒),取值范围 [1, 3600] |
| `update_interval_seconds` | 5 | 缓存刷新间隔(秒),必须 >= 1 |

参数不合法时 `CreateRateLimitedBackupPolicy` 返回 `NULL`

### 使用自定义 BackupRequestPolicy

如需完全控制,可实现 `BackupRequestPolicy` 接口并设置到 `ChannelOptions.backup_request_policy`

```c++
#include "brpc/backup_request_policy.h"

class MyBackupPolicy : public brpc::BackupRequestPolicy {
public:
int32_t GetBackupRequestMs(const brpc::Controller*) const override {
return 10; // 10ms后发送backup
}
bool DoBackup(const brpc::Controller*) const override {
return should_allow_backup(); // 自定义逻辑
}
void OnRPCEnd(const brpc::Controller*) override {
// 每次RPC结束时调用,可在此更新统计
}
};

MyBackupPolicy my_policy;
brpc::ChannelOptions options;
options.backup_request_policy = &my_policy; // Channel不拥有该对象,需保证其生命周期长于Channel
channel.Init(..., &options);
```
### 实现说明
- 比例通过bvar计数器在滑动时间窗口内统计。缓存值通过无锁CAS选举最多每 `update_interval_seconds` 刷新一次,因此每次RPC的开销极低(公共路径仅有两次原子读)。
- Backup决策在做出时立即计数(RPC完成前),以便在延迟抖动期间更快地反馈。总RPC数在完成时统计。这意味着比例在抖动期间可能短暂滞后,这是设计有意为之——限流器的目标是近似的尽力而为的节流,而非精确执行。
- 每个使用限流的Channel会维护两个 `bvar::Window` 采样任务,在Channel数量极多的部署中请留意此开销。
# 当后端server不能挂在一个命名服务内时
【推荐】建立一个开启backup request的SelectiveChannel,其中包含两个sub channel。访问这个SelectiveChannel和上面的情况类似,会先访问一个sub channel,如果在ChannelOptions.backup_request_ms后没返回,再访问另一个sub channel。如果一个sub channel对应一个集群,这个方法就是在两个集群间做互备。SelectiveChannel的例子见[example/selective_echo_c++](https://github.com/apache/brpc/tree/master/example/selective_echo_c++),具体做法请参考上面的过程。
Expand Down
74 changes: 74 additions & 0 deletions docs/en/backup_request.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,80 @@ my_func_latency << tm.u_elapsed(); // u represents for microsecond, and s_elaps
// All work is done here. My_func_qps, my_func_latency, my_func_latency_cdf and many other counters would be shown in /vars.
```
## Rate-limited backup requests
To limit the ratio of backup requests sent, use the built-in factory function or implement the `BackupRequestPolicy` interface yourself.
Priority order: `backup_request_policy` > `backup_request_ms`.
### Using the built-in rate-limiting policy
Call `CreateRateLimitedBackupPolicy` and set the result on `ChannelOptions.backup_request_policy`:
```c++
#include "brpc/backup_request_policy.h"
#include <memory>
brpc::RateLimitedBackupPolicyOptions opts;
opts.backup_request_ms = 10; // send backup if RPC does not complete within 10ms
opts.max_backup_ratio = 0.3; // cap backup requests at 30% of total
opts.window_size_seconds = 10; // sliding window width in seconds
opts.update_interval_seconds = 5; // how often the cached ratio is refreshed
// The caller owns the returned pointer.
// The policy must outlive the channel — destroy the channel before the policy.
std::unique_ptr<brpc::BackupRequestPolicy> policy(
brpc::CreateRateLimitedBackupPolicy(opts));
brpc::ChannelOptions options;
options.backup_request_policy = policy.get(); // NOT owned by channel
channel.Init(..., &options);
// channel must be destroyed before policy goes out of scope.
```

`RateLimitedBackupPolicyOptions` fields:

| Field | Default | Description |
|-------|---------|-------------|
| `backup_request_ms` | -1 | Timeout threshold in ms. -1 means inherit from `ChannelOptions.backup_request_ms` (only works when the policy is set via `ChannelOptions.backup_request_policy`; at controller level there is no channel-level fallback, so set an explicit >= 0 value instead). Must be >= -1. |
| `max_backup_ratio` | 0.1 | Max backup ratio; range (0, 1] |
| `window_size_seconds` | 10 | Sliding window width in seconds; range [1, 3600] |
| `update_interval_seconds` | 5 | Cached-ratio refresh interval in seconds; must be >= 1 |

`CreateRateLimitedBackupPolicy` returns `NULL` if any parameter is invalid.

### Using a custom BackupRequestPolicy

For full control, implement the `BackupRequestPolicy` interface and set it on `ChannelOptions.backup_request_policy`:

```c++
#include "brpc/backup_request_policy.h"

class MyBackupPolicy : public brpc::BackupRequestPolicy {
public:
int32_t GetBackupRequestMs(const brpc::Controller*) const override {
return 10; // send backup after 10ms
}
bool DoBackup(const brpc::Controller*) const override {
return should_allow_backup(); // your logic here
}
void OnRPCEnd(const brpc::Controller*) override {
// called on every RPC completion; update stats if needed
}
};

MyBackupPolicy my_policy;
brpc::ChannelOptions options;
options.backup_request_policy = &my_policy; // NOT owned by channel; must outlive channel
channel.Init(..., &options);
```
### Implementation notes
- The ratio is computed over a sliding time window using bvar counters. The cached value is refreshed at most once per `update_interval_seconds` using a lock-free CAS election, so the overhead per RPC is very low (two atomic loads in the common path).
- Backup decisions are counted immediately at decision time (before the RPC completes) to provide faster feedback during latency spikes. Total RPCs are counted on completion. This means the ratio may transiently lag during a spike, but this is intentional — the limiter is designed for approximate, best-effort throttling, not exact enforcement.
- Each channel using rate limiting maintains two `bvar::Window` sampler tasks. Keep this in mind in deployments with a very large number of channels.
# When backend servers cannot be hung in a naming service
[Recommended] Define a SelectiveChannel that sets backup request, in which contains two sub channel. The visiting process of this SelectiveChannel is similar to the above situation. It will visit one sub channel first. If the response is not returned after channelOptions.backup_request_ms ms, then another sub channel is visited. If a sub channel corresponds to a cluster, this method does backups between two clusters. An example of SelectiveChannel can be found in [example/selective_echo_c++](https://github.com/apache/brpc/tree/master/example/selective_echo_c++). More details please refer to the above program.
Expand Down
177 changes: 177 additions & 0 deletions src/brpc/backup_request_policy.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.

#include "brpc/backup_request_policy.h"

#include "butil/logging.h"
#include "bvar/reducer.h"
#include "bvar/window.h"
#include "butil/atomicops.h"
#include "butil/time.h"

namespace brpc {

// Standalone statistics module for tracking backup/total request ratio
// within a sliding time window. Each instance schedules two bvar::Window
// sampler tasks; keep this in mind for high channel-count deployments.
class BackupRateLimiter {
public:
BackupRateLimiter(double max_backup_ratio,
int window_size_seconds,
int update_interval_seconds)
: _max_backup_ratio(max_backup_ratio)
, _update_interval_us(update_interval_seconds * 1000000LL)
, _total_count()
, _backup_count()
, _total_window(&_total_count, window_size_seconds)
, _backup_window(&_backup_count, window_size_seconds)
, _cached_ratio(0.0)
, _last_update_us(0) {
}

// All atomic operations use relaxed ordering intentionally.
// This is best-effort rate limiting: a slightly stale ratio is
// acceptable for approximate throttling. Within a single update interval,
// the cached ratio is not updated, so bursts up to update_interval_seconds
// in duration can exceed the configured max_backup_ratio transiently.
bool ShouldAllow() const {
const int64_t now_us = butil::cpuwide_time_us();
int64_t last_us = _last_update_us.load(butil::memory_order_relaxed);
double ratio = _cached_ratio.load(butil::memory_order_relaxed);

if (now_us - last_us >= _update_interval_us) {
if (_last_update_us.compare_exchange_strong(
last_us, now_us, butil::memory_order_relaxed)) {
int64_t total = _total_window.get_value();
int64_t backup = _backup_window.get_value();
// Fall back to cumulative counts when the window has no
// sampled data yet (cold-start within the first few seconds).
if (total <= 0) {
total = _total_count.get_value();
backup = _backup_count.get_value();
}
if (total > 0) {
ratio = static_cast<double>(backup) / total;
} else if (backup > 0) {
// Backups issued but no completions in window yet (latency spike).
// Be conservative to prevent backup storms.
ratio = 1.0;
} else {
// True cold-start: no traffic yet. Allow freely.
ratio = 0.0;
}
_cached_ratio.store(ratio, butil::memory_order_relaxed);
}
}

bool allow = ratio < _max_backup_ratio;
if (allow) {
// Count backup decisions immediately for faster feedback
// during latency spikes (before RPCs complete).
_backup_count << 1;
}
return allow;
}

void OnRPCEnd(const Controller* /*controller*/) {
// Count each completed user-level RPC (called once per RPC, not per leg).
// Backup decisions are counted in ShouldAllow() at decision time for
// faster feedback. As a result, the effective suppression threshold is
// (backup_count / total_count), where total_count is the number of
// user RPCs that have completed.
_total_count << 1;
}

private:
double _max_backup_ratio;
int64_t _update_interval_us;

bvar::Adder<int64_t> _total_count;
mutable bvar::Adder<int64_t> _backup_count;
bvar::Window<bvar::Adder<int64_t>> _total_window;
bvar::Window<bvar::Adder<int64_t>> _backup_window;

mutable butil::atomic<double> _cached_ratio;
mutable butil::atomic<int64_t> _last_update_us;
};

// Internal BackupRequestPolicy that composes a BackupRateLimiter
// for ratio-based suppression.
class RateLimitedBackupPolicy : public BackupRequestPolicy {
public:
RateLimitedBackupPolicy(int32_t backup_request_ms,
double max_backup_ratio,
int window_size_seconds,
int update_interval_seconds)
: _backup_request_ms(backup_request_ms)
, _rate_limiter(max_backup_ratio, window_size_seconds,
update_interval_seconds) {
}

int32_t GetBackupRequestMs(const Controller* /*controller*/) const override {
return _backup_request_ms;
}

bool DoBackup(const Controller* /*controller*/) const override {
return _rate_limiter.ShouldAllow();
}

void OnRPCEnd(const Controller* controller) override {
_rate_limiter.OnRPCEnd(controller);
}

private:
int32_t _backup_request_ms;
BackupRateLimiter _rate_limiter;
};

BackupRequestPolicy* CreateRateLimitedBackupPolicy(
const RateLimitedBackupPolicyOptions& options) {
if (options.backup_request_ms < -1) {
LOG(ERROR) << "Invalid backup_request_ms=" << options.backup_request_ms
<< ", must be >= -1 (-1 means inherit from ChannelOptions)";
return NULL;
}
if (options.max_backup_ratio <= 0 || options.max_backup_ratio > 1.0) {
LOG(ERROR) << "Invalid max_backup_ratio=" << options.max_backup_ratio
<< ", must be in (0, 1]";
return NULL;
}
if (options.window_size_seconds < 1 || options.window_size_seconds > 3600) {
LOG(ERROR) << "Invalid window_size_seconds=" << options.window_size_seconds
<< ", must be in [1, 3600]";
return NULL;
}
if (options.update_interval_seconds < 1) {
LOG(ERROR) << "Invalid update_interval_seconds="
<< options.update_interval_seconds << ", must be >= 1";
return NULL;
}
if (options.update_interval_seconds > options.window_size_seconds) {
LOG(WARNING) << "update_interval_seconds=" << options.update_interval_seconds
<< " exceeds window_size_seconds=" << options.window_size_seconds
<< "; the ratio window will rarely refresh within its own period";
}
// Plain new (without std::nothrow): brpc follows the project-wide convention
// of letting OOM throw/abort rather than returning NULL. NULL return from
// this factory already signals invalid parameters, not allocation failure.
return new RateLimitedBackupPolicy(
options.backup_request_ms, options.max_backup_ratio,
options.window_size_seconds, options.update_interval_seconds);
}

} // namespace brpc
Loading
Loading