Exit JVM and dump heap on OutOfMemoryError for server containers by jcschaff · Pull Request #1683 · virtualcell/vcell

jcschaff · 2026-05-07T03:38:13Z

Summary

Add -XX:+ExitOnOutOfMemoryError, -XX:+HeapDumpOnOutOfMemoryError, and -XX:HeapDumpPath=/dump to the long-running server JVMs (data, api, submit, sched, db, rest). On OutOfMemoryError the JVM now writes a heap dump to /dump and exits cleanly so K8s recreates the container, instead of limping along in undefined state.

Why

Confirmed via 213 days of Loki retention scan that the prod data-pod wedges on 2026-05-04 and 2026-05-06 were both precipitated by Java heap space errors corrupting the JVM-wide static InactivityMonitor.READ_CHECK_TIMER mid-TimerTask:

06:39:39  ERROR  Java heap space
06:39:54  ERROR  Java heap space
06:40:38  ERROR  Java heap space (×4)
06:41:27.157  ERROR  Scheduled task error              ← TimerThread's task hit OOM, thread died
06:41:27.188  WARN   FailoverTransport ... after: 1 attempt(s) with Timer already cancelled.
[wedge from this point on; manual restart required]

With ExitOnOutOfMemoryError, the JVM aborts on the first OOM before the InactivityMonitor's TimerThread can be touched, eliminating the OOM-driven path to the wedge entirely. The companion JmsFailoverWatchdog in #1681 remains as defense-in-depth for non-OOM wedge causes.

HeapDumpOnOutOfMemoryError writes a .hprof file to /dump immediately before the JVM aborts — keeps a postmortem artifact for analysis (Eclipse MAT, VisualVM, JXRay, etc.) so we can find what's actually consuming the heap.

Companion PR (already merged)

virtualcell/vcell-fluxcd adds a /dump emptyDir volume mount (sizeLimit 4Gi, default node-disk backing) to all six affected Deployments and bumps the data container's resources.limits.memory from 3000Mi to 8000Mi. That PR is merged; without these JVM flags it's simply a latent mount and extra heap headroom — no behavior change. This PR activates the actual exit-on-OOM behavior on the next image roll.

Scope

Dockerfile	Change
`docker/build/Dockerfile-data-dev`	3 flags after `-XX:MaxRAMPercentage=80`
`docker/build/Dockerfile-api-dev`	same
`docker/build/Dockerfile-submit-dev`	same
`docker/build/Dockerfile-sched-dev`	same
`docker/build/Dockerfile-db-dev`	same
`vcell-rest/src/main/docker/Dockerfile.jvm`	appended to `JAVA_OPTS`
`vcell-rest/src/main/docker/Dockerfile.legacy-jar`	appended to `JAVA_OPTS`

Skipped intentionally: Dockerfile-batch-dev (per-job SLURM lifecycle, no /dump volume), Dockerfile-clientgen-dev and docker/build/admin (build/utility tools), Dockerfile.native and Dockerfile.native-micro (Quarkus native image — -XX: flags don't apply to GraalVM builds).

Test plan

git diff --stat confirms 7 files, 20 inserts / 2 deletes
grep verification that all 3 flags landed in each of the 7 Dockerfiles
CI image build succeeds for each Dockerfile (relies on existing CI)

After release tag (probably 7.7.0.76), fluxcd auto-rolls dev. Verify on the new data pod:

kubectl -n dev exec deployment/data -- cat /proc/1/cmdline | tr '\0' ' ' | grep -oE 'ExitOnOutOfMemoryError|HeapDumpOnOutOfMemoryError|HeapDumpPath=/dump'

Should print all three flags.

kubectl -n dev exec deployment/data -- ls -la /dump shows the directory empty and writable.
Promote to stage, then prod.
On the next OOM event in any environment, confirm:
- JVM stderr in Loki: Aborting due to java.lang.OutOfMemoryError
- Heap dump file present at /dump/java_pid1.hprof (use kubectl cp to retrieve)
- K8s pod restart event in kubectl describe pod and in Loki's kubectl container stream

🤖 Generated with Claude Code

Add -XX:+ExitOnOutOfMemoryError, -XX:+HeapDumpOnOutOfMemoryError, and -XX:HeapDumpPath=/dump to the long-running server JVMs (data, api, submit, sched, db, rest). On OOM the JVM now writes a heap dump to /dump and exits cleanly so K8s recreates the container, instead of limping along in undefined state. Motivated by the 2026-05-04 / 2026-05-06 prod data-pod wedges. Both were precipitated by Java heap space errors which corrupted the JVM-wide static InactivityMonitor Timer (a TimerTask hit OOM mid-run and the TimerThread silently terminated). With these flags, the JVM aborts on the first OOM before the InactivityMonitor TimerThread can be touched, eliminating the OOM-driven path to the failover-wedge condition entirely. The companion JmsFailoverWatchdog (in PR #1681) remains as defense-in-depth for non-OOM wedge causes. Targeted scope: - 5 swarm-style server Dockerfiles (data, api, submit, sched, db) — add the three flags right after -XX:MaxRAMPercentage=80. - 2 Quarkus rest Dockerfiles (Dockerfile.jvm, Dockerfile.legacy-jar) — append to JAVA_OPTS. Skipped intentionally: - Dockerfile-batch-dev (per-job SLURM lifecycle, no /dump volume). - Dockerfile-clientgen-dev / docker/build/admin (build/utility tools). - Dockerfile.native and Dockerfile.native-micro (Quarkus native image; -XX: flags don't apply to GraalVM native builds). Companion change required in vcell-fluxcd: add a /dump emptyDir volume mount to each affected Deployment so the JVM has somewhere to write the dump. Without that mount the dump silently fails and the JVM still exits — log signal still works, just no postmortem artifact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bound the FailoverTransport reconnect budget with maxReconnectAttempts=20 and exponential backoff (1s → 30s) in VCMessagingServiceActiveMQ; keep startupMaxReconnectAttempts=-1 so pod boot tolerates a slow broker. Without this the FailoverTransport reconnects forever, which is wrong behavior in K8s where pod restart is the right response to a sustained broker outage. Add JmsFailoverWatchdog: a TransportListener attached to each ActiveMQConnection that runs a caller-supplied Runnable when the failover layer reports a terminal IOException. The terminal action is constructor- injected so production wiring stays visible at the composition root and tests can substitute their own handler. Two factory methods: logOnly() (the default — log lifecycle events but take no further action) and jvmExitOnTerminal() (escape hatch for any future service that wants K8s pod recycle on terminal transport failure). VCMessagingServiceJms holds a watchdog field with a setter; the default is JmsFailoverWatchdog.logOnly(). No service opts into jvmExitOnTerminal in this change — the setter is the escape hatch for any future need. Wired into MessageProducerSessionJms and ConsumerContextJms — the two long-lived JMS connection sites. Short-lived batch processes (OptimizationBatchServer, JavaSimulationExecutable) intentionally skipped. Why this is defense-in-depth, not the primary fix: the OOM-driven wedge mechanism that originally motivated this work — a TimerTask hitting OutOfMemoryError on the JVM-wide static InactivityMonitor heartbeat Timer, killing the TimerThread silently and corrupting the failover transport for the rest of the JVM lifetime — is closed off by the -XX:+ExitOnOutOfMemoryError flag added in PR #1683 (the JVM aborts on the first OOM before the InactivityMonitor TimerThread can be touched). The watchdog covers non-OOM terminal-failover paths: sustained network partition, broker maintenance > 8 min, or any future client regression in the static-Timer / static-counter design (still present in 5.18.x). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jcschaff merged commit 4a77981 into master May 7, 2026
13 checks passed

jcschaff deleted the jvm-exit-on-oom branch May 7, 2026 04:14

jcschaff mentioned this pull request May 7, 2026

Don't cache failed DNS lookups in the desktop client #1684

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exit JVM and dump heap on OutOfMemoryError for server containers#1683

Exit JVM and dump heap on OutOfMemoryError for server containers#1683
jcschaff merged 1 commit intomasterfrom
jvm-exit-on-oom

jcschaff commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jcschaff commented May 7, 2026

Summary

Why

Companion PR (already merged)

Scope

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant