Fix stats performance #138126

masseyke · 2025-11-14T23:58:24Z

This fixes the N^2 performance problem described in #97222. In addition to restoring the previous partial fix (#130857), it does the following:

IndicesQueryCache::getStats now accepts a Supplier so that we can only call IndicesQueryCache::getSharedRamSizeForAllShards if it is absolutely needed. This fixes an N^2 performance problem that Improving statsByShard performance when the number of shards is very large #130857 introduced. If a user called TransportIndicesStatsAction but did not request query cache stats, then before Improving statsByShard performance when the number of shards is very large #130857 we did not enter the N^2 loop (it was only entered if a user did request query cache stats). But after Improving statsByShard performance when the number of shards is very large #130857, we had the N^2 performance all the time. This is a pretty big problem for clusters with large shards since this is called very frequently (including every 30 seconds by a background task).
It fixes the N^2 performance in TransportIndicesStatsAction by sharing state across all shardOperation calls on a single node using the new NodeContext feature from Adding NodeContext to TransportBroadcastByNodeAction #138057.

Closes #97222

… is very large (elastic#130857)" (elastic#137973) (elastic#137984) This reverts commit 391de08.

… to avoid the N^2 performance in TransportIndicesStatsAction if the user did not ask for query cache stats (although it is still there if the user asks for query cache stats). It also avoid O(N) performance in TransportClusterStatsAction if the user did not ask for query cache stats

…performance when a user asks for query cache stats

elasticsearchmachine · 2025-11-14T23:59:02Z

Hi @masseyke, I've created a changelog YAML for you.

Copilot

Pull Request Overview

This PR fixes an N^2 performance issue in stats collection for clusters with large numbers of shards. The key improvements are: (1) making IndicesQueryCache::getStats accept a Supplier<Long> to defer expensive calculations until needed, and (2) using the NodeContext feature to share cache totals computation across all shard operations on a node.

Modified IndicesQueryCache to compute shared RAM sizes more efficiently by pre-calculating totals once per node
Updated TransportIndicesStatsAction and TransportClusterStatsAction to use cached suppliers for cache totals
Changed all callers of getStats() to provide the precomputed shared RAM supplier

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`server/src/main/java/org/elasticsearch/indices/IndicesQueryCache.java`	Added methods to compute cache totals once and reuse them; changed `getStats()` to accept a `Supplier<Long>` for lazy evaluation; fixed typos in comments
`server/src/main/java/org/elasticsearch/action/admin/indices/stats/TransportIndicesStatsAction.java`	Implemented NodeContext pattern using `CachedSupplier` to share cache totals across shard operations
`server/src/main/java/org/elasticsearch/action/admin/cluster/stats/TransportClusterStatsAction.java`	Applied similar `CachedSupplier` pattern for cache totals computation
`server/src/main/java/org/elasticsearch/indices/IndicesService.java`	Modified `statsByShard()` to precompute shared RAM once and pass it to each shard
`server/src/main/java/org/elasticsearch/action/admin/indices/stats/CommonStats.java`	Updated `getShardLevelStats()` signature to accept `Supplier<Long>` for precomputed shared RAM
`server/src/test/java/org/elasticsearch/indices/IndicesQueryCacheTests.java`	Updated all test calls to new `getStats()` API; added test helper methods; contains duplicate line that should be removed
`server/src/test/java/org/elasticsearch/indices/IndicesServiceTests.java`	Updated mock setup to use argument matchers for new signature
`server/src/test/java/org/elasticsearch/indices/IndicesServiceCloseTests.java`	Updated test calls to pass `() -> 0L` supplier to `getStats()`
`server/src/test/java/org/elasticsearch/index/shard/IndexShardTests.java`	Updated test call to new `getShardLevelStats()` signature
`server/src/test/java/org/elasticsearch/action/admin/cluster/stats/VersionStatsTests.java`	Updated test call to new `getShardLevelStats()` signature
`docs/changelog/138126.yaml`	Added changelog entry for this PR
`docs/changelog/130857.yaml`	Added changelog entry for related PR #130857

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

server/src/test/java/org/elasticsearch/indices/IndicesQueryCacheTests.java

server/src/main/java/org/elasticsearch/indices/IndicesQueryCache.java

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

masseyke · 2025-11-18T15:54:42Z

To test this, I stood up a single-node cluster and loaded 30,000 shards with cache-test.sh. I also modified IndicesQueryCache.INDICES_QUERIES_CACHE_ALL_SEGMENTS_SETTING to always be true (for the sake of simplicity and not having to remember to set that on my cluster).
After loading 30k+ shards, here are what query times to _stats looked like (more than 20s):

ubuntu@ip-172-31-14-192:~$ time curl localhost:9200/_stats > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 32.6M    0 32.6M    0     0  1643k      0 --:--:--  0:00:20 --:--:-- 10.4M

real	0m20.348s
user	0m0.010s
sys	0m0.016s
ubuntu@ip-172-31-14-192:~$ time curl localhost:9200/_stats > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 32.6M    0 32.6M    0     0  1595k      0 --:--:--  0:00:20 --:--:-- 8924k

real	0m20.975s
user	0m0.007s
sys	0m0.014s

After applying the change in this PR, restarting the node, waiting for all 30,000+ shards to become allocated, and running the warm-up query from cache-test.sh, here is what it looked like:

ubuntu@ip-172-31-14-192:~/async-profiler-4.2-linux-x64$ time curl localhost:9200/_stats > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 33.7M    0 33.7M    0     0  36.9M      0 --:--:-- --:--:-- --:--:-- 36.9M

real	0m0.922s
user	0m0.009s
sys	0m0.010s
ubuntu@ip-172-31-14-192:~/async-profiler-4.2-linux-x64$ time curl localhost:9200/_stats > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 33.7M    0 33.7M    0     0  37.2M      0 --:--:-- --:--:-- --:--:-- 37.2M

real	0m0.914s
user	0m0.012s
sys	0m0.006s

Looking at the flamegraph for CommonStats.getShardLevelStats() during that query, the purple part on the right is all that's left of IndicesQueryCache.getStats():

Digging in a little more to just IndicesQueryCache.getStats(), the majority of the IndicesQueryCache.getStats() time is as expected in the initialization of the CachedSupplier that has been added to TransportIndicesStatsAction:

Performing only a request for query_cache stats, the response time is faster:

ubuntu@ip-172-31-14-192:~/async-profiler-4.2-linux-x64$ time curl localhost:9200/_stats/query_cache > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2484k    0 2484k    0     0  8996k      0 --:--:-- --:--:-- --:--:-- 9002k

real	0m0.285s
user	0m0.006s
sys	0m0.005s

And a request for something other than query_cache stats is also fast since we don't go through the query_cache path at all (one of the problems with the original fix):

ubuntu@ip-172-31-14-192:~/async-profiler-4.2-linux-x64$ time curl localhost:9200/_stats/indexing > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 4712k    0 4712k    0     0  16.1M      0 --:--:-- --:--:-- --:--:-- 16.2M

real	0m0.292s
user	0m0.004s
sys	0m0.005s

elasticsearchmachine · 2025-11-18T15:55:54Z

Pinging @elastic/es-data-management (Team:Data Management)

seanzatzdev

LGTM, thanks

seanzatzdev · 2025-11-17T21:25:52Z

docs/changelog/130857.yaml

Curious what this second changelog entry is for

Ohh it's because I reverted #130857 and then un-reverted it. So we got the changelog for #130857 and for this PR. I'll remove the one for 130857

masseyke · 2025-11-19T14:03:47Z

Adding one last data point -- I now have 82,632 shards on a single node, and GET _stats is taking about 2.5s. Looking at it in the profiler, the runtime proportions of different pieces seem to be staying the same -- we just have a lot more shards. Within that 2.5s, the slowest pieces (in descending order) are:

IndexShard::docStats
IndexShard::segmentStats
IndexShard::translogStats
IndexShard::storeStats
IndexShard::indexingStats
IndexQueryCache::getStats (the subject of this PR)
IndexShard::denseVectorStats

masseyke · 2025-11-26T14:50:26Z

Note: This hasn't been backported yet, but will be in the near future.

masseyke added 3 commits November 14, 2025 17:10

Reapply "Improving statsByShard performance when the number of shards…

a3cf3f2

… is very large (elastic#130857)" (elastic#137973) (elastic#137984) This reverts commit 391de08.

Using the NodeContext in TransportBroadcastByNodeAction to avoid N^2 …

f516df1

…performance when a user asks for query cache stats

masseyke added >bug :Data Management/Stats Statistics tracking and retrieval APIs v9.3.0 labels Nov 14, 2025

masseyke added 3 commits November 14, 2025 17:59

Update docs/changelog/138126.yaml

6ade46d

Removing unused method and fixing tests

c1903f9

Merge branch 'main' into fix-stats-performance

3ec4474

masseyke requested a review from Copilot November 17, 2025 19:47

Copilot started reviewing on behalf of masseyke November 17, 2025 19:48 View session

Copilot finished reviewing on behalf of masseyke November 17, 2025 19:49

Copilot AI reviewed Nov 17, 2025

View reviewed changes

Apply suggestions from code review

ec707e5

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

masseyke marked this pull request as ready for review November 18, 2025 15:55

elasticsearchmachine added the Team:Data Management Meta label for data/management team label Nov 18, 2025

Merge branch 'main' into fix-stats-performance

98b763d

seanzatzdev approved these changes Nov 18, 2025

View reviewed changes

masseyke added 2 commits November 18, 2025 12:06

Cleaning up changelog

df90b96

Merge branch 'main' into fix-stats-performance

fc029a5

joegallo approved these changes Nov 18, 2025

View reviewed changes

masseyke merged commit d74334e into elastic:main Nov 18, 2025
34 checks passed

masseyke deleted the fix-stats-performance branch November 18, 2025 19:26

masseyke added v9.1.9 v8.19.9 v9.2.3 labels Nov 26, 2025

masseyke added the backport pending label Nov 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix stats performance #138126

Fix stats performance #138126

Uh oh!

masseyke commented Nov 14, 2025 •

edited

Loading

Uh oh!

elasticsearchmachine commented Nov 14, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

masseyke commented Nov 18, 2025 •

edited

Loading

Uh oh!

elasticsearchmachine commented Nov 18, 2025

Uh oh!

seanzatzdev left a comment

Uh oh!

seanzatzdev Nov 17, 2025

Uh oh!

masseyke Nov 18, 2025

Uh oh!

Uh oh!

masseyke commented Nov 19, 2025 •

edited

Loading

Uh oh!

masseyke commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix stats performance #138126

Fix stats performance #138126

Uh oh!

Conversation

masseyke commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Nov 14, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

masseyke commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Nov 18, 2025

Uh oh!

seanzatzdev left a comment

Choose a reason for hiding this comment

Uh oh!

seanzatzdev Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

masseyke Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

masseyke commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

masseyke commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

masseyke commented Nov 14, 2025 •

edited

Loading

masseyke commented Nov 18, 2025 •

edited

Loading

masseyke commented Nov 19, 2025 •

edited

Loading