LAC: fallback to legacy group when pd not support keyspace#10929
LAC: fallback to legacy group when pd not support keyspace#10929yongman wants to merge 5 commits into
Conversation
Signed-off-by: yongman <yming0221@gmail.com>
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
📝 WalkthroughWalkthroughLocalAdmissionController now supports compatibility lookup for keyspace-less resource groups, reserved-default fallback behavior, keyspace-normalized GAC handling, multi-context etcd watch management, and updated tests for these flows. ChangesLAC Old-PD Compatibility and Watch Hardening
Estimated code review effort: 4 (Complex) | ~60 minutes Sequence Diagram(s)sequenceDiagram
participant Client
participant LocalAdmissionController
participant PDClient
participant EtcdWatch
Client->>LocalAdmissionController: warmupResourceGroupInfoCache(keyspace, name)
LocalAdmissionController->>LocalAdmissionController: check exact cache hit
alt cache miss
LocalAdmissionController->>PDClient: getResourceGroup(keyspace, name)
alt PD error
LocalAdmissionController->>PDClient: legacy getResourceGroup(name)
alt legacy succeeds
LocalAdmissionController->>LocalAdmissionController: insert legacy or reserved-default group
else legacy fails
LocalAdmissionController->>LocalAdmissionController: return error
end
else PD success
LocalAdmissionController->>LocalAdmissionController: normalize and insert group
end
end
EtcdWatch->>LocalAdmissionController: PUT event
LocalAdmissionController->>LocalAdmissionController: normalize keyspace id
LocalAdmissionController->>LocalAdmissionController: promote compat fallback to exact group
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Signed-off-by: yongman <yming0221@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (3)
dbms/src/Flash/ResourceControl/tests/gtest_local_admission_controller.cpp (2)
27-33: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAlign the new test variables with TiFlash camelCase naming.
This file introduces several snake_case variables in new C++ code (
get_resource_group,acquire_token_buckets,keyspace_id,group_request, etc.). Renaming them now keeps the tests consistent with the rest of the codebase.Suggested rename pattern
- std::function<resource_manager::GetResourceGroupResponse(const resource_manager::GetResourceGroupRequest &)> - get_resource_group_, - std::function<resource_manager::TokenBucketsResponse(const resource_manager::TokenBucketsRequest &)> - acquire_token_buckets_ = {}) - : get_resource_group(std::move(get_resource_group_)) - , acquire_token_buckets(std::move(acquire_token_buckets_)) + std::function<resource_manager::GetResourceGroupResponse(const resource_manager::GetResourceGroupRequest &)> + getResourceGroupFn, + std::function<resource_manager::TokenBucketsResponse(const resource_manager::TokenBucketsRequest &)> + acquireTokenBucketsFn = {}) + : getResourceGroupCb(std::move(getResourceGroupFn)) + , acquireTokenBucketsCb(std::move(acquireTokenBucketsFn)) @@ - return get_resource_group(req); + return getResourceGroupCb(req); @@ - if (acquire_token_buckets) - return acquire_token_buckets(req); + if (acquireTokenBucketsCb) + return acquireTokenBucketsCb(req); @@ - get_resource_group; + getResourceGroupCb; @@ - acquire_token_buckets; + acquireTokenBucketsCb;As per coding guidelines,
**/*.{cpp,h,hpp}: Method and variable names should usecamelCase.Also applies to: 49-53, 59-223
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@dbms/src/Flash/ResourceControl/tests/gtest_local_admission_controller.cpp` around lines 27 - 33, The new test code in TestPDClient and the related gtest helpers uses snake_case names, which conflicts with TiFlash camelCase conventions. Rename the affected variables and any associated helper locals in gtest_local_admission_controller.cpp, such as the fields in TestPDClient and the request/keyspace-related test variables, to camelCase so the new tests match the surrounding codebase style and naming guidelines.Source: Coding guidelines
155-223: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick winExercise the keyspace-less matching path with duplicate group names.
Line 155 only drives a single
"default"request, so this still passes even ifhandleTokenBucketsRespignores the new request-order fallback entirely. Add at least one more request for"default"under a different keyspace and assert the handled pairs come back in request order.Minimal way to harden this test
- constexpr KeyspaceID keyspace_id = 13; + constexpr KeyspaceID keyspaceId = 13; + constexpr KeyspaceID otherKeyspaceId = 17; @@ - EXPECT_EQ(req.requests_size(), 1); - if (req.requests_size() != 1) + EXPECT_EQ(req.requests_size(), 2); + if (req.requests_size() != 2) return resp; - EXPECT_EQ(req.requests(0).resource_group_name(), "default"); - EXPECT_TRUE(req.requests(0).has_keyspace_id()); - if (!req.requests(0).has_keyspace_id()) - return resp; - EXPECT_EQ(req.requests(0).keyspace_id().value(), keyspace_id); + EXPECT_EQ(req.requests(0).resource_group_name(), "default"); + EXPECT_EQ(req.requests(1).resource_group_name(), "default"); + EXPECT_EQ(req.requests(0).keyspace_id().value(), keyspaceId); + EXPECT_EQ(req.requests(1).keyspace_id().value(), otherKeyspaceId); - auto * one_resp = resp.add_responses(); - one_resp->set_resource_group_name("default"); - auto * granted = one_resp->add_granted_r_u_tokens(); - granted->set_type(resource_manager::RequestUnitType::RU); - granted->set_trickle_time_ms(0); - granted->mutable_granted_tokens()->set_tokens(512); - granted->mutable_granted_tokens()->mutable_settings()->set_burst_limit(2048); + for (size_t i = 0; i < 2; ++i) + { + auto * oneResp = resp.add_responses(); + oneResp->set_resource_group_name("default"); + auto * granted = oneResp->add_granted_r_u_tokens(); + granted->set_type(resource_manager::RequestUnitType::RU); + granted->set_trickle_time_ms(0); + granted->mutable_granted_tokens()->set_tokens(512); + granted->mutable_granted_tokens()->mutable_settings()->set_burst_limit(2048); + } return resp; }); @@ - auto * group_request = req.add_requests(); - group_request->set_resource_group_name("default"); - group_request->mutable_keyspace_id()->set_value(keyspace_id); - auto * ru_items = group_request->mutable_ru_items(); - auto * request_ru = ru_items->add_request_r_u(); - request_ru->set_type(resource_manager::RequestUnitType::RU); - request_ru->set_value(512); - - const auto req_rg_names = std::vector<std::pair<KeyspaceID, std::string>>{{keyspace_id, "default"}}; - auto handled = lac.handleTokenBucketsResp(cluster.pd_client->acquireTokenBuckets(req), req_rg_names); - ASSERT_EQ(handled.size(), 1); - EXPECT_EQ(handled[0], std::make_pair(keyspace_id, std::string("default"))); + for (const auto currentKeyspaceId : {keyspaceId, otherKeyspaceId}) + { + auto * groupRequest = req.add_requests(); + groupRequest->set_resource_group_name("default"); + groupRequest->mutable_keyspace_id()->set_value(currentKeyspaceId); + auto * ruItems = groupRequest->mutable_ru_items(); + auto * requestRu = ruItems->add_request_r_u(); + requestRu->set_type(resource_manager::RequestUnitType::RU); + requestRu->set_value(512); + } + + const auto reqRgNames = std::vector<std::pair<KeyspaceID, std::string>>{ + {keyspaceId, "default"}, + {otherKeyspaceId, "default"}, + }; + auto handled = lac.handleTokenBucketsResp(cluster.pd_client->acquireTokenBuckets(req), reqRgNames); + ASSERT_EQ(handled.size(), 2); + EXPECT_EQ(handled[0], std::make_pair(keyspaceId, std::string("default"))); + EXPECT_EQ(handled[1], std::make_pair(otherKeyspaceId, std::string("default")));🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@dbms/src/Flash/ResourceControl/tests/gtest_local_admission_controller.cpp` around lines 155 - 223, This test only covers a single "default" token-bucket request, so it can pass even if the request-order fallback in handleTokenBucketsResp is broken. Update TokenBucketResponseWithoutKeyspaceMatchesKeyspaceRequest to add a second "default" request for a different KeyspaceID, then verify the returned handled pairs preserve the original request order. Keep the existing warmupResourceGroupInfoCache, acquireTokenBuckets, and handleTokenBucketsResp flow, but extend the req_rg_names expectations and assertions to cover both requests.dbms/src/Flash/ResourceControl/LocalAdmissionController.h (1)
513-513: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winUse TiFlash width aliases for the new constant.
uint64_t/int32_tbypass the project aliases; useUInt64/Int32here. As per coding guidelines, "Use explicit width types fromdbms/src/Core/Types.h:UInt8,UInt32,Int64,Float64,String."Proposed change
- static constexpr uint64_t RESERVED_DEFAULT_RESOURCE_GROUP_RU_PER_SEC = std::numeric_limits<int32_t>::max(); + static constexpr UInt64 RESERVED_DEFAULT_RESOURCE_GROUP_RU_PER_SEC = std::numeric_limits<Int32>::max();🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@dbms/src/Flash/ResourceControl/LocalAdmissionController.h` at line 513, The new reservation constant in LocalAdmissionController should use the project’s TiFlash width aliases instead of raw fixed-width types. Update RESERVED_DEFAULT_RESOURCE_GROUP_RU_PER_SEC to use UInt64 and Int32 (from Core/Types.h) so it matches the codebase typing guidelines and keeps the declaration consistent with other aliases in this class.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@dbms/src/Flash/ResourceControl/LocalAdmissionController.cpp`:
- Around line 62-63: `LocalAdmissionController::buildGACRequest` currently
bypasses `enable_gac` only for normal request building, but the final-report
path still emits GAC requests and `stop()` can send them even when GAC is
disabled. Update the `buildGACRequest(true)` flow and the `stop()` send path to
honor `enable_gac` consistently, so fallback groups with `enable_gac=false` do
not report or transmit unsupported GAC data. Use the existing `enable_gac`,
`buildGACRequest`, and `stop` symbols to gate both cached-group reporting and
final request submission.
- Around line 349-367: Track the actual success of the legacy PD lookup in
LocalAdmissionController::warmupResourceGroupInfoCache instead of relying only
on legacy_resp.has_error(), because a thrown getResourceGroup(legacy_req) leaves
legacy_resp default-constructed and can incorrectly pass validation. Add an
explicit success flag around the
cluster->pd_client->getResourceGroup(legacy_req) call, set it only when the
request returns normally, and require that flag before validating or accepting
the legacy response; otherwise fall through to the reserved-default fallback
path.
In `@dbms/src/Flash/ResourceControl/LocalAdmissionController.h`:
- Around line 618-624: The removal path in LocalAdmissionController is
recomputing max RU while the old reserved group is still present, so the stale
entry can still influence the global maximum. In the cleanup logic that checks
iter->second->enable_gac and calls updateMaxRUPerSecAfterDeleteWithoutLock(),
erase the old keyspace_resource_groups entry first, then recompute
max_ru_per_sec using the removed group’s user_ru_per_sec so the replacement
GAC-enabled group is reflected correctly. Use the existing
keyspace_resource_groups, iter, and updateMaxRUPerSecAfterDeleteWithoutLock flow
to keep the change localized.
---
Nitpick comments:
In `@dbms/src/Flash/ResourceControl/LocalAdmissionController.h`:
- Line 513: The new reservation constant in LocalAdmissionController should use
the project’s TiFlash width aliases instead of raw fixed-width types. Update
RESERVED_DEFAULT_RESOURCE_GROUP_RU_PER_SEC to use UInt64 and Int32 (from
Core/Types.h) so it matches the codebase typing guidelines and keeps the
declaration consistent with other aliases in this class.
In `@dbms/src/Flash/ResourceControl/tests/gtest_local_admission_controller.cpp`:
- Around line 27-33: The new test code in TestPDClient and the related gtest
helpers uses snake_case names, which conflicts with TiFlash camelCase
conventions. Rename the affected variables and any associated helper locals in
gtest_local_admission_controller.cpp, such as the fields in TestPDClient and the
request/keyspace-related test variables, to camelCase so the new tests match the
surrounding codebase style and naming guidelines.
- Around line 155-223: This test only covers a single "default" token-bucket
request, so it can pass even if the request-order fallback in
handleTokenBucketsResp is broken. Update
TokenBucketResponseWithoutKeyspaceMatchesKeyspaceRequest to add a second
"default" request for a different KeyspaceID, then verify the returned handled
pairs preserve the original request order. Keep the existing
warmupResourceGroupInfoCache, acquireTokenBuckets, and handleTokenBucketsResp
flow, but extend the req_rg_names expectations and assertions to cover both
requests.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: f50ddec2-458c-4f53-ba69-95104e7293f2
📒 Files selected for processing (3)
dbms/src/Flash/ResourceControl/LocalAdmissionController.cppdbms/src/Flash/ResourceControl/LocalAdmissionController.hdbms/src/Flash/ResourceControl/tests/gtest_local_admission_controller.cpp
Signed-off-by: yongman <yming0221@gmail.com>
7f0b351 to
cd382a5
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
dbms/src/Flash/ResourceControl/tests/gtest_local_admission_controller.cpp (1)
51-155: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueConsider extracting shared test setup.
The three tests repeat the same pattern: construct
pingcap::kv::Cluster, wrap a lambda inTestPDClient, and instantiateLocalAdmissionControllerwith identical fixed args. A small helper (or aTEST_Ffixture) building the cluster/LAC pair, parameterized by the response callback, would reduce duplication.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@dbms/src/Flash/ResourceControl/tests/gtest_local_admission_controller.cpp` around lines 51 - 155, The three LocalAdmissionController tests duplicate the same cluster and controller setup, so extract that shared setup into a helper or a TEST_F fixture. Factor out the repeated pingcap::kv::Cluster construction, TestPDClient lambda wiring, and LocalAdmissionController instantiation in gtest_local_admission_controller.cpp, then have KeyspaceWatchUpdatePromotesLegacyFallbackToExactGroup, KeyspaceLookupDoesNotFallbackForNonNotFoundError, and ReservedDefaultGroupStillFallbacksWhenLegacyLookupThrows reuse it with only their response callback differing.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@dbms/src/Flash/ResourceControl/LocalAdmissionController.cpp`:
- Around line 1025-1067: The refill callback in
LocalAdmissionController::modifyResourceGroup is being invoked after releasing
mu, which can race with a concurrent clear/destroy of the RCQ. Move the
refill_token_callback invocation so it happens inside the same locked scope that
checks need_refill, matching the existing locked callback path used elsewhere in
LocalAdmissionController to keep the resource control queue alive during the
call.
---
Nitpick comments:
In `@dbms/src/Flash/ResourceControl/tests/gtest_local_admission_controller.cpp`:
- Around line 51-155: The three LocalAdmissionController tests duplicate the
same cluster and controller setup, so extract that shared setup into a helper or
a TEST_F fixture. Factor out the repeated pingcap::kv::Cluster construction,
TestPDClient lambda wiring, and LocalAdmissionController instantiation in
gtest_local_admission_controller.cpp, then have
KeyspaceWatchUpdatePromotesLegacyFallbackToExactGroup,
KeyspaceLookupDoesNotFallbackForNonNotFoundError, and
ReservedDefaultGroupStillFallbacksWhenLegacyLookupThrows reuse it with only
their response callback differing.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 88ad5ffb-979d-4ff7-8eef-483d44116e44
📒 Files selected for processing (3)
dbms/src/Flash/ResourceControl/LocalAdmissionController.cppdbms/src/Flash/ResourceControl/LocalAdmissionController.hdbms/src/Flash/ResourceControl/tests/gtest_local_admission_controller.cpp
🚧 Files skipped from review as they are similar to previous changes (1)
- dbms/src/Flash/ResourceControl/LocalAdmissionController.h
|
/retest |
cd382a5 to
82147c7
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
dbms/src/Flash/ResourceControl/tests/gtest_local_admission_controller.cpp (1)
55-71: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueConsider extracting the shared response-building lambda.
The success-path
GetResourceGroupResponseconstruction (group name, mode, priority, fill/burst rate) is duplicated almost verbatim betweenKeyspaceWatchUpdatePromotesLegacyFallbackToExactGroupandKeyspaceLookupDoesNotFallbackForNonNotFoundError. A small helper (e.g.,makeSuccessResponse(priority)) would reduce duplication.Also applies to: 110-126
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@dbms/src/Flash/ResourceControl/tests/gtest_local_admission_controller.cpp` around lines 55 - 71, The success-path GetResourceGroupResponse construction is duplicated in the TestPDClient lambda used by KeyspaceWatchUpdatePromotesLegacyFallbackToExactGroup and KeyspaceLookupDoesNotFallbackForNonNotFoundError. Extract the shared response-building logic into a small helper, such as a local makeSuccessResponse(priority) function or lambda, and have both tests call it so the group name, mode, priority, and RU settings are built in one place. Keep the error branch for the has_keyspace_id() case separate, and reuse the helper in the duplicated test setup code.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@dbms/src/Flash/ResourceControl/tests/gtest_local_admission_controller.cpp`:
- Around line 55-71: The success-path GetResourceGroupResponse construction is
duplicated in the TestPDClient lambda used by
KeyspaceWatchUpdatePromotesLegacyFallbackToExactGroup and
KeyspaceLookupDoesNotFallbackForNonNotFoundError. Extract the shared
response-building logic into a small helper, such as a local
makeSuccessResponse(priority) function or lambda, and have both tests call it so
the group name, mode, priority, and RU settings are built in one place. Keep the
error branch for the has_keyspace_id() case separate, and reuse the helper in
the duplicated test setup code.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 55ae1bb6-35a3-486d-b118-0c83fb95ae70
📒 Files selected for processing (3)
dbms/src/Flash/ResourceControl/LocalAdmissionController.cppdbms/src/Flash/ResourceControl/LocalAdmissionController.hdbms/src/Flash/ResourceControl/tests/gtest_local_admission_controller.cpp
🚧 Files skipped from review as they are similar to previous changes (2)
- dbms/src/Flash/ResourceControl/LocalAdmissionController.h
- dbms/src/Flash/ResourceControl/LocalAdmissionController.cpp
|
@yongman: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What problem does this PR solve?
Issue Number: close #10939
Problem Summary:
There is compatible issue when the tiflash runs with pd without keyspace resource group.
Ref: #10218
LocalAdmissionControllerto keep working when TiFlash runs with keyspace support enabled but PD still serves only legacy resource groups. It does that by:GetResourceGrouplookup to the legacy nullspace lookup,"default"group when PD does not persist one,What is changed and how it works?
ResourceGroupnow has anenable_gacflag. Reserved local defaults are created withenable_gac=false, so normal token acquisition/reporting skips them.buildRequestInfoIfNecessary()returns no request when that flag is disabled. (dbms/src/Flash/ResourceControl/LocalAdmissionController.h:80-90,dbms/src/Flash/ResourceControl/LocalAdmissionController.cpp:77-117)LocalAdmissionControllernow resolves lookups throughfindResourceGroupWithCompatWithoutLock(): it first checks the exact(keyspace_id, name)cache entry, then falls back to(NullspaceID, name)for keyspace requests. All normal runtime callers (getPriority,consumeResource,estWaitDuraMS) now use that compatibility lookup. (dbms/src/Flash/ResourceControl/LocalAdmissionController.h:566-599)warmupResourceGroupInfoCache()changed from a single keyspace-scoped PD lookup into a compatibility sequence:GetResourceGroupwithkeyspace_id,resp.has_error(), retry withoutkeyspace_id,"default", synthesize a reserved local default group.Returned protobufs are normalized so missing
keyspace_idbecomesNullspaceID. (dbms/src/Flash/ResourceControl/LocalAdmissionController.cpp:334-420)keyspace_identirely for nullspace groups, which keeps fallback traffic compatible with legacy PD semantics. Response handling and metrics also use the cached group’s actual keyspace instead of the original request keyspace. (dbms/src/Flash/ResourceControl/LocalAdmissionController.cpp:503-587,dbms/src/Flash/ResourceControl/LocalAdmissionController.cpp:694-840)with_keyspace=true, the constructor now starts two watch loops, one forresource_group/keyspace/settingsand one forresource_group/settings. To support two concurrent watchers, the code replaced the single storedgrpc::ClientContextwith anactive_watch_gac_grpc_contextsset sostop()can cancel every active watch stream. (dbms/src/Flash/ResourceControl/LocalAdmissionController.h:367-388,dbms/src/Flash/ResourceControl/LocalAdmissionController.h:724-728,dbms/src/Flash/ResourceControl/LocalAdmissionController.cpp:843-936,dbms/src/Flash/ResourceControl/LocalAdmissionController.cpp:1059-1099)(
dbms/src/Flash/ResourceControl/tests/gtest_local_admission_controller.cpp:59-238)Check List
Tests
Side effects
Documentation
Release note
Manual Test
Summary by CodeRabbit