-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Fix stats performance #138126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix stats performance #138126
Conversation
… is very large (elastic#130857)" (elastic#137973) (elastic#137984) This reverts commit 391de08.
… to avoid the N^2 performance in TransportIndicesStatsAction if the user did not ask for query cache stats (although it is still there if the user asks for query cache stats). It also avoid O(N) performance in TransportClusterStatsAction if the user did not ask for query cache stats
…performance when a user asks for query cache stats
|
Hi @masseyke, I've created a changelog YAML for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes an N^2 performance issue in stats collection for clusters with large numbers of shards. The key improvements are: (1) making IndicesQueryCache::getStats accept a Supplier<Long> to defer expensive calculations until needed, and (2) using the NodeContext feature to share cache totals computation across all shard operations on a node.
- Modified
IndicesQueryCacheto compute shared RAM sizes more efficiently by pre-calculating totals once per node - Updated
TransportIndicesStatsActionandTransportClusterStatsActionto use cached suppliers for cache totals - Changed all callers of
getStats()to provide the precomputed shared RAM supplier
Reviewed Changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
server/src/main/java/org/elasticsearch/indices/IndicesQueryCache.java |
Added methods to compute cache totals once and reuse them; changed getStats() to accept a Supplier<Long> for lazy evaluation; fixed typos in comments |
server/src/main/java/org/elasticsearch/action/admin/indices/stats/TransportIndicesStatsAction.java |
Implemented NodeContext pattern using CachedSupplier to share cache totals across shard operations |
server/src/main/java/org/elasticsearch/action/admin/cluster/stats/TransportClusterStatsAction.java |
Applied similar CachedSupplier pattern for cache totals computation |
server/src/main/java/org/elasticsearch/indices/IndicesService.java |
Modified statsByShard() to precompute shared RAM once and pass it to each shard |
server/src/main/java/org/elasticsearch/action/admin/indices/stats/CommonStats.java |
Updated getShardLevelStats() signature to accept Supplier<Long> for precomputed shared RAM |
server/src/test/java/org/elasticsearch/indices/IndicesQueryCacheTests.java |
Updated all test calls to new getStats() API; added test helper methods; contains duplicate line that should be removed |
server/src/test/java/org/elasticsearch/indices/IndicesServiceTests.java |
Updated mock setup to use argument matchers for new signature |
server/src/test/java/org/elasticsearch/indices/IndicesServiceCloseTests.java |
Updated test calls to pass () -> 0L supplier to getStats() |
server/src/test/java/org/elasticsearch/index/shard/IndexShardTests.java |
Updated test call to new getShardLevelStats() signature |
server/src/test/java/org/elasticsearch/action/admin/cluster/stats/VersionStatsTests.java |
Updated test call to new getShardLevelStats() signature |
docs/changelog/138126.yaml |
Added changelog entry for this PR |
docs/changelog/130857.yaml |
Added changelog entry for related PR #130857 |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
server/src/test/java/org/elasticsearch/indices/IndicesQueryCacheTests.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/indices/IndicesQueryCache.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/indices/IndicesQueryCache.java
Outdated
Show resolved
Hide resolved
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
To test this, I stood up a single-node cluster and loaded 30,000 shards with cache-test.sh. I also modified IndicesQueryCache.INDICES_QUERIES_CACHE_ALL_SEGMENTS_SETTING to always be true (for the sake of simplicity and not having to remember to set that on my cluster). After applying the change in this PR, restarting the node, waiting for all 30,000+ shards to become allocated, and running the warm-up query from cache-test.sh, here is what it looked like: Looking at the flamegraph for CommonStats.getShardLevelStats() during that query, the purple part on the right is all that's left of IndicesQueryCache.getStats(): Performing only a request for query_cache stats, the response time is faster: And a request for something other than query_cache stats is also fast since we don't go through the query_cache path at all (one of the problems with the original fix): |
|
Pinging @elastic/es-data-management (Team:Data Management) |
seanzatzdev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks
docs/changelog/130857.yaml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious what this second changelog entry is for
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Adding one last data point -- I now have 82,632 shards on a single node, and
|
|
Note: This hasn't been backported yet, but will be in the near future. |


This fixes the N^2 performance problem described in #97222. In addition to restoring the previous partial fix (#130857), it does the following:
Closes #97222