Skip to content

Exit JVM and dump heap on OutOfMemoryError for server containers#1683

Merged
jcschaff merged 1 commit intomasterfrom
jvm-exit-on-oom
May 7, 2026
Merged

Exit JVM and dump heap on OutOfMemoryError for server containers#1683
jcschaff merged 1 commit intomasterfrom
jvm-exit-on-oom

Conversation

@jcschaff
Copy link
Copy Markdown
Member

@jcschaff jcschaff commented May 7, 2026

Summary

Add -XX:+ExitOnOutOfMemoryError, -XX:+HeapDumpOnOutOfMemoryError, and -XX:HeapDumpPath=/dump to the long-running server JVMs (data, api, submit, sched, db, rest). On OutOfMemoryError the JVM now writes a heap dump to /dump and exits cleanly so K8s recreates the container, instead of limping along in undefined state.

Why

Confirmed via 213 days of Loki retention scan that the prod data-pod wedges on 2026-05-04 and 2026-05-06 were both precipitated by Java heap space errors corrupting the JVM-wide static InactivityMonitor.READ_CHECK_TIMER mid-TimerTask:

06:39:39  ERROR  Java heap space
06:39:54  ERROR  Java heap space
06:40:38  ERROR  Java heap space (×4)
06:41:27.157  ERROR  Scheduled task error              ← TimerThread's task hit OOM, thread died
06:41:27.188  WARN   FailoverTransport ... after: 1 attempt(s) with Timer already cancelled.
[wedge from this point on; manual restart required]

With ExitOnOutOfMemoryError, the JVM aborts on the first OOM before the InactivityMonitor's TimerThread can be touched, eliminating the OOM-driven path to the wedge entirely. The companion JmsFailoverWatchdog in #1681 remains as defense-in-depth for non-OOM wedge causes.

HeapDumpOnOutOfMemoryError writes a .hprof file to /dump immediately before the JVM aborts — keeps a postmortem artifact for analysis (Eclipse MAT, VisualVM, JXRay, etc.) so we can find what's actually consuming the heap.

Companion PR (already merged)

virtualcell/vcell-fluxcd adds a /dump emptyDir volume mount (sizeLimit 4Gi, default node-disk backing) to all six affected Deployments and bumps the data container's resources.limits.memory from 3000Mi to 8000Mi. That PR is merged; without these JVM flags it's simply a latent mount and extra heap headroom — no behavior change. This PR activates the actual exit-on-OOM behavior on the next image roll.

Scope

Dockerfile Change
docker/build/Dockerfile-data-dev 3 flags after -XX:MaxRAMPercentage=80
docker/build/Dockerfile-api-dev same
docker/build/Dockerfile-submit-dev same
docker/build/Dockerfile-sched-dev same
docker/build/Dockerfile-db-dev same
vcell-rest/src/main/docker/Dockerfile.jvm appended to JAVA_OPTS
vcell-rest/src/main/docker/Dockerfile.legacy-jar appended to JAVA_OPTS

Skipped intentionally: Dockerfile-batch-dev (per-job SLURM lifecycle, no /dump volume), Dockerfile-clientgen-dev and docker/build/admin (build/utility tools), Dockerfile.native and Dockerfile.native-micro (Quarkus native image — -XX: flags don't apply to GraalVM builds).

Test plan

  • git diff --stat confirms 7 files, 20 inserts / 2 deletes
  • grep verification that all 3 flags landed in each of the 7 Dockerfiles
  • CI image build succeeds for each Dockerfile (relies on existing CI)
  • After release tag (probably 7.7.0.76), fluxcd auto-rolls dev. Verify on the new data pod:
    kubectl -n dev exec deployment/data -- cat /proc/1/cmdline | tr '\0' ' ' | grep -oE 'ExitOnOutOfMemoryError|HeapDumpOnOutOfMemoryError|HeapDumpPath=/dump'
    Should print all three flags.
  • kubectl -n dev exec deployment/data -- ls -la /dump shows the directory empty and writable.
  • Promote to stage, then prod.
  • On the next OOM event in any environment, confirm:
    • JVM stderr in Loki: Aborting due to java.lang.OutOfMemoryError
    • Heap dump file present at /dump/java_pid1.hprof (use kubectl cp to retrieve)
    • K8s pod restart event in kubectl describe pod and in Loki's kubectl container stream

🤖 Generated with Claude Code

Add -XX:+ExitOnOutOfMemoryError, -XX:+HeapDumpOnOutOfMemoryError, and
-XX:HeapDumpPath=/dump to the long-running server JVMs (data, api,
submit, sched, db, rest). On OOM the JVM now writes a heap dump to
/dump and exits cleanly so K8s recreates the container, instead of
limping along in undefined state.

Motivated by the 2026-05-04 / 2026-05-06 prod data-pod wedges. Both
were precipitated by Java heap space errors which corrupted the
JVM-wide static InactivityMonitor Timer (a TimerTask hit OOM mid-run
and the TimerThread silently terminated). With these flags, the JVM
aborts on the first OOM before the InactivityMonitor TimerThread can
be touched, eliminating the OOM-driven path to the failover-wedge
condition entirely. The companion JmsFailoverWatchdog (in PR #1681)
remains as defense-in-depth for non-OOM wedge causes.

Targeted scope:
- 5 swarm-style server Dockerfiles (data, api, submit, sched, db) —
  add the three flags right after -XX:MaxRAMPercentage=80.
- 2 Quarkus rest Dockerfiles (Dockerfile.jvm, Dockerfile.legacy-jar) —
  append to JAVA_OPTS.

Skipped intentionally:
- Dockerfile-batch-dev (per-job SLURM lifecycle, no /dump volume).
- Dockerfile-clientgen-dev / docker/build/admin (build/utility tools).
- Dockerfile.native and Dockerfile.native-micro (Quarkus native image;
  -XX: flags don't apply to GraalVM native builds).

Companion change required in vcell-fluxcd: add a /dump emptyDir volume
mount to each affected Deployment so the JVM has somewhere to write
the dump. Without that mount the dump silently fails and the JVM still
exits — log signal still works, just no postmortem artifact.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jcschaff jcschaff merged commit 4a77981 into master May 7, 2026
13 checks passed
@jcschaff jcschaff deleted the jvm-exit-on-oom branch May 7, 2026 04:14
jcschaff added a commit that referenced this pull request May 7, 2026
Bound the FailoverTransport reconnect budget with maxReconnectAttempts=20
and exponential backoff (1s → 30s) in VCMessagingServiceActiveMQ; keep
startupMaxReconnectAttempts=-1 so pod boot tolerates a slow broker.
Without this the FailoverTransport reconnects forever, which is wrong
behavior in K8s where pod restart is the right response to a sustained
broker outage.

Add JmsFailoverWatchdog: a TransportListener attached to each
ActiveMQConnection that runs a caller-supplied Runnable when the failover
layer reports a terminal IOException. The terminal action is constructor-
injected so production wiring stays visible at the composition root and
tests can substitute their own handler. Two factory methods: logOnly()
(the default — log lifecycle events but take no further action) and
jvmExitOnTerminal() (escape hatch for any future service that wants K8s
pod recycle on terminal transport failure).

VCMessagingServiceJms holds a watchdog field with a setter; the default
is JmsFailoverWatchdog.logOnly(). No service opts into jvmExitOnTerminal
in this change — the setter is the escape hatch for any future need.

Wired into MessageProducerSessionJms and ConsumerContextJms — the two
long-lived JMS connection sites. Short-lived batch processes
(OptimizationBatchServer, JavaSimulationExecutable) intentionally skipped.

Why this is defense-in-depth, not the primary fix: the OOM-driven wedge
mechanism that originally motivated this work — a TimerTask hitting
OutOfMemoryError on the JVM-wide static InactivityMonitor heartbeat
Timer, killing the TimerThread silently and corrupting the failover
transport for the rest of the JVM lifetime — is closed off by the
-XX:+ExitOnOutOfMemoryError flag added in PR #1683 (the JVM aborts on
the first OOM before the InactivityMonitor TimerThread can be touched).
The watchdog covers non-OOM terminal-failover paths: sustained network
partition, broker maintenance > 8 min, or any future client regression
in the static-Timer / static-counter design (still present in 5.18.x).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant