Skip to content

server: make HikariCP leak detection configurable#13407

Open
andrijapanicsb wants to merge 1 commit into
apache:4.22from
andrijapanicsb:hikari-leak-detection-4.22.1.0
Open

server: make HikariCP leak detection configurable#13407
andrijapanicsb wants to merge 1 commit into
apache:4.22from
andrijapanicsb:hikari-leak-detection-4.22.1.0

Conversation

@andrijapanicsb

Copy link
Copy Markdown
Contributor

Description

This PR makes the HikariCP leak-detection threshold and JMX MBean registration configurable per database pool via db.properties, instead of relying on HikariCP defaults that cannot be changed without a code change.

CloudStack already maps a subset of db.properties values onto HikariConfig in framework/db/src/main/java/com/cloud/utils/db/TransactionLegacy.java (e.g. maxActive, maxIdle, maxWait, minIdleConnections, connectionTimeout, keepAliveTime). This PR adds two more, following the exact same parsing/threading pattern:

Property Maps to Default
db.<pool>.leakDetectionThreshold HikariConfig#setLeakDetectionThreshold(long) 0 (disabled)
db.<pool>.registerMbeans HikariConfig#setRegisterMbeans(boolean) false (disabled)

Supported for all three pools that use the shared datasource factory: cloud, usage, simulator.

Behaviour:

  • leakDetectionThreshold absent or 0 → leak detection disabled (unchanged default behaviour). Only applied when set to a value > 0. (HikariCP itself ignores values below 2000 ms with a warning.)
  • registerMbeans absent or false → MBeans disabled (unchanged default). true → Hikari JMX MBeans registered for live pool-counter observation.
  • Settings are applied only on the HikariCP path; the DBCP datasource path is unaffected.
  • A debug-level log line reports the effective values per pool (no credentials or sensitive data).

Motivation / context: in production we saw the management server become unstable — and eventually crash — on clusters exercising Host-HA. Watching MySQL with SHOW PROCESSLIST during the incident showed the number of sessions owned by the cloud DB user climbing steadily over a couple of hours, all of them in the Sleep state, until the HikariCP pool (db.cloud.maxActive, default 250) was exhausted and the server could no longer borrow a connection. That signature — monotonically growing, never-reaped, all idle, all owned by the cloud user — is a classic DB connection leak in a periodic code path (suspected Host-HA host checks) that borrows a pooled connection and never returns it.

The problem is these symptoms tell you that connections leak, not where. HikariCP already has the exact tool for that — leakDetectionThreshold — but CloudStack hard-wires it off with no way to turn it on. This PR exposes it (and registerMbeans) through db.properties so an operator can enable leak detection on a live server; HikariCP then logs an Apparent connection leak detected stack trace identifying the precise code path that borrowed the connection and failed to return it, and the MBeans give live pool-counter visibility. The actual leak fix is a separate change; this PR is the diagnostic enabler.

Everything is disabled by default, so there is no behavioural change for existing deployments that don't set the new properties.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • Build/CI
  • Test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

N/A

How Has This Been Tested?

Build: compiled the affected module and its dependencies off tag 4.22.1.0:

mvn -pl framework/db -am -DskipTests compile

Result: BUILD SUCCESS (checkstyle passed). The change is confined to property parsing and threading through the existing createDataSourcecreateHikaricpDataSource chain, reusing the existing parseNumber(...) helper.

Unit tests: the apply-logic is factored into a package-private applyHikariDebugSettings(HikariConfig, Long, Boolean, String) and covered by 4 new TransactionLegacyTest cases — defaults-disabled, 0-keeps-disabled, leak-detection-enabled (60000), and register-MBeans-enabled:

mvn -pl framework/db test -Dtest='TransactionLegacyTest#applyHikariDebugSettings*'

Result: Tests run: 4, Failures: 0, Errors: 0 — BUILD SUCCESS.

Runtime validation plan (on a patched management server):

  1. In /etc/cloudstack/management/db.properties:
    db.cloud.leakDetectionThreshold=60000
    db.cloud.registerMbeans=true
    
  2. systemctl restart cloudstack-management
  3. Reproduce the Host-HA scenario and watch:
    grep -iE "leak detection|Apparent connection leak|ProxyLeakTask|Hikari" \
      /var/log/cloudstack/management/management-server.log
    
    Expected: java.lang.Exception: Apparent connection leak detected with a stack trace through com.zaxxer.hikari.HikariDataSource.getConnection(...)com.cloud.utils.db.TransactionLegacy... identifying the borrowing path.
  4. With registerMbeans=true, the com.zaxxer.hikari:type=Pool (cloud) MBean is visible via JMX for live pool counters.

How did you try to break this feature and the system with this change?

Edge cases considered:

  • Property absent → null → leak detection disabled, registerMbeans=false (existing behaviour preserved).
  • leakDetectionThreshold=0 → not applied (disabled).
  • leakDetectionThreshold between 1–1999 ms → passed to Hikari, which warns and ignores it (documented Hikari behaviour; noted in the code comment and the sample config).
  • registerMbeans=false explicitly → MBeans off.
  • DBCP connection-pool path → new values are not forwarded, so behaviour is unchanged.
  • Default datasource fallback path (getDefaultHikaricpDataSource) → untouched.

These cases (defaults, 0, enabled threshold, enabled MBeans) are locked down by the new applyHikariDebugSettings unit tests.

@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 47.36842% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 17.67%. Comparing base (348ce95) to head (1d2bfae).
⚠️ Report is 13 commits behind head on 4.22.

Files with missing lines Patch % Lines
...ain/java/com/cloud/utils/db/TransactionLegacy.java 47.36% 9 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               4.22   #13407      +/-   ##
============================================
- Coverage     17.68%   17.67%   -0.01%     
+ Complexity    15793    15791       -2     
============================================
  Files          5922     5922              
  Lines        533123   533182      +59     
  Branches      65201    65210       +9     
============================================
- Hits          94268    94251      -17     
- Misses       428212   428284      +72     
- Partials      10643    10647       +4     
Flag Coverage Δ
uitests 3.69% <ø> (-0.01%) ⬇️
unittests 18.75% <47.36%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@andrijapanicsb

Copy link
Copy Markdown
Contributor Author

@blueorangutan package kvm

@blueorangutan

Copy link
Copy Markdown

@andrijapanicsb a [SL] Jenkins job has been kicked to build packages. It will be bundled with kvm SystemVM template(s). I'll keep you posted as I make progress.

@blueorangutan

Copy link
Copy Markdown

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 18233

@andrijapanicsb

Copy link
Copy Markdown
Contributor Author

@blueorangutan help

@blueorangutan

Copy link
Copy Markdown

@andrijapanicsb [SL] I understand these words: "help", "hello", "thanks", "package", "test"
Test command usage: test [mgmt os] [hypervisor] [keepEnv] [qemuEv] [basicZone|securityGroups]
Mgmt OS options: ['suse15', 'alma10', 'ol10', 'rocky10', 'alma9', 'centos7', 'centos6', 'rocky9', 'alma8', 'ubuntu18', 'ol9', 'ol8', 'ubuntu22', 'debian12', 'ubuntu20', 'rocky8', 'ubuntu24']
Hypervisor options: ['kvm-centos6', 'kvm-centos7', 'kvm-rocky8', 'kvm-rocky9', 'kvm-rocky10', 'kvm-ol8', 'kvm-ol9', 'kvm-ol10', 'kvm-alma8', 'kvm-alma9', 'kvm-alma10', 'kvm-ubuntu18', 'kvm-ubuntu20', 'kvm-ubuntu22', 'kvm-ubuntu24', 'kvm-debian12', 'kvm-suse15', 'vmware-55u3', 'vmware-60u2', 'vmware-65u2', 'vmware-67u3', 'vmware-70u1', 'vmware-70u2', 'vmware-70u3', 'vmware-80', 'vmware-80u1', 'vmware-80u2', 'vmware-80u3', 'vmware-80u3e', 'xenserver-65sp1', 'xenserver-71', 'xenserver-74', 'xenserver-84', 'xcpng74', 'xcpng76', 'xcpng80', 'xcpng81', 'xcpng82', 'xcpng83']
Note: when keepEnv is passed, you need to specify mgmt server os and hypervisor or use the matrix command.
when qemuEv is passed, it will deploy KVM hyperviosr hosts with qemu-kvm-ev, else it will default to stock qemu.
When basicZone and/or securityGroups are passed it will create a zone of the last type specified (default is Advanced)
Package command usage: package [all(default value),kvm,xen,vmware,hyperv,ovm] - a comma separated list can be passed with package command to bundle the required hypervisor's systemVM templates. Not passing any argument will bundle all - kvm,xen and vmware templates.

Blessed contributors for kicking Trillian test jobs: ['rohityadavcloud', 'shwstppr', 'Damans227', 'vishesh92', 'Pearl1594', 'harikrishna-patnala', 'nvazquez', 'DaanHoogland', 'weizhouapache', 'borisstoyanov', 'vladimirpetrov', 'kiranchavala', 'andrijapanicsb', 'NuxRo', 'rajujith', 'sureshanaparti', 'abh1sar', 'sudo87', 'RosiKyu']

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants