Don't cache failed DNS lookups in the desktop client by jcschaff · Pull Request #1684 · virtualcell/vcell

jcschaff · 2026-05-07T14:43:29Z

Summary

When a laptop wakes from sleep, the macOS resolver briefly returns EAI_NONAME for VCell internal hostnames while network state catches up. Java's InetAddress cache picks up the negative result. Our AsynchMessageManager polling loop hits the cache faster than the 10-second default negative TTL can expire, so the cached failure never clears and the client wedges at "connecting…" indefinitely. Restarting the client is the only workaround.

Observed in ~/.vcell/logs/vcellrun_alpha.log:

>> polling failure << vcell-dev.cam.uchc.edu: nodename nor servname provided, or not known
>> polling failure << vcell-dev.cam.uchc.edu
>> polling failure << vcell-dev.cam.uchc.edu
[repeating]

The OS resolver itself is healthy at the time — host vcell-dev.cam.uchc.edu returns the correct IP. The cache is purely inside the JVM.

Fix

Set asymmetric DNS-cache TTL via Security.setProperty() at the top of VCellClientMain.main(), before any networking happens in the JVM:

java.security.Security.setProperty("networkaddress.cache.ttl", "30");
java.security.Security.setProperty("networkaddress.cache.negative.ttl", "0");

networkaddress.cache.ttl=30 — modest positive cache so the polling loop doesn't hammer DNS, but stale entries don't linger past 30s. Matches the default for non-SecurityManager JVMs (we don't install one, but stating it explicitly is robust against future config changes).
networkaddress.cache.negative.ttl=0 — never cache failures. The cost of one extra DNS query per failed poll is trivial; the cost of the wedge is several minutes of user confusion + a restart.

Asymmetric TTL is the established pattern for long-running JVMs that need to recover from transient DNS issues. AWS SDK (guide), Elasticsearch operations docs, and the Oracle networking-properties reference all recommend this shape.

Why programmatic, not `-D` flags or install4j config

These are security properties, not system properties — -Dnetworkaddress.cache.ttl=30 does not work. (Confirmed via Oracle JDK docs and OpenJDK source.) The legacy fallback -Dsun.net.inetaddr.ttl=30 works on older JDKs but is undocumented/unstable in modern Java.

Security.setProperty() in main() is the only reliable mechanism that:

Works identically across Java 8/11/17/21.
Is visible to anyone reading the code.
Doesn't require adding a custom java.security overlay file to the install4j build.

Critical timing constraint: the cache initializes lazily and reads Security properties on first DNS lookup. Verified that VCellClientMain.main() (lines 82-91) does no networking before commandLine.execute(args), so setting at the top of main() is safe with margin.

Test plan

mvn compile -pl vcell-client -am passes
Reproduce the original failure pattern (optional, manual):
- Run the client from IDE.
- In another terminal, sudo killall -HUP mDNSResponder while the client is polling — simulates a brief DNS hiccup.
- Confirm the client recovers within ~1 polling interval after DNS comes back, instead of getting stuck.
Ship in next desktop release. No server change needed; the fix is purely client-side.

Out of scope (filed mentally as future work)

Java 21+ networkaddress.cache.stale.ttl (JDK-8306653, "stale-data reuse") — different problem; revisit when desktop moves to a newer JVM.
HTTP-client connection pooling that may also hold stale state across sleep — orthogonal symptom requiring different mitigation.
Server-side variants (-Dnetworkaddress.cache.* for vcell-data, vcell-rest, etc.). The server pods don't sleep so they're less exposed; the OOM-driven failure modes are already addressed by Exit JVM and dump heap on OutOfMemoryError for server containers #1683.

🤖 Generated with Claude Code

When a laptop wakes from sleep, transient DNS resolution failures get cached by Java's InetAddress cache. Our continuous polling loop then hits the cached failure faster than the 10s default negative TTL can expire, wedging the client at "connecting…" until the user restarts it. The OS resolver is fine — `host vcell-dev.cam.uchc.edu` returns the correct IP — but the JVM never re-queries it. Set Security properties at the top of main(), before any DNS lookup: - networkaddress.cache.ttl=30 (modest positive cache, matches the non-SecurityManager default but makes it explicit and stable) - networkaddress.cache.negative.ttl=0 (never cache failures; recover immediately when DNS comes back) Asymmetric TTL is the established pattern for long-running JVMs that need to recover from transient DNS issues — see AWS SDK and Oracle networking-properties guidance. Programmatic Security.setProperty is the only reliable mechanism: -D JVM args don't bridge to security properties, and -Dsun.net.inetaddr.ttl is undocumented in modern Java. Observed in user log: >> polling failure << vcell-dev.cam.uchc.edu: nodename nor servname provided, or not known (repeating, after laptop sleep/wake) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jcschaff merged commit ffb6f99 into master May 7, 2026
13 checks passed

jcschaff deleted the client-dns-negative-cache branch May 7, 2026 15:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't cache failed DNS lookups in the desktop client#1684

Don't cache failed DNS lookups in the desktop client#1684
jcschaff merged 1 commit intomasterfrom
client-dns-negative-cache

jcschaff commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jcschaff commented May 7, 2026

Summary

Fix

Why programmatic, not -D flags or install4j config

Test plan

Out of scope (filed mentally as future work)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Why programmatic, not `-D` flags or install4j config