Skip to content

Don't cache failed DNS lookups in the desktop client#1684

Merged
jcschaff merged 1 commit intomasterfrom
client-dns-negative-cache
May 7, 2026
Merged

Don't cache failed DNS lookups in the desktop client#1684
jcschaff merged 1 commit intomasterfrom
client-dns-negative-cache

Conversation

@jcschaff
Copy link
Copy Markdown
Member

@jcschaff jcschaff commented May 7, 2026

Summary

When a laptop wakes from sleep, the macOS resolver briefly returns EAI_NONAME for VCell internal hostnames while network state catches up. Java's InetAddress cache picks up the negative result. Our AsynchMessageManager polling loop hits the cache faster than the 10-second default negative TTL can expire, so the cached failure never clears and the client wedges at "connecting…" indefinitely. Restarting the client is the only workaround.

Observed in ~/.vcell/logs/vcellrun_alpha.log:

>> polling failure << vcell-dev.cam.uchc.edu: nodename nor servname provided, or not known
>> polling failure << vcell-dev.cam.uchc.edu
>> polling failure << vcell-dev.cam.uchc.edu
[repeating]

The OS resolver itself is healthy at the time — host vcell-dev.cam.uchc.edu returns the correct IP. The cache is purely inside the JVM.

Fix

Set asymmetric DNS-cache TTL via Security.setProperty() at the top of VCellClientMain.main(), before any networking happens in the JVM:

java.security.Security.setProperty("networkaddress.cache.ttl", "30");
java.security.Security.setProperty("networkaddress.cache.negative.ttl", "0");
  • networkaddress.cache.ttl=30 — modest positive cache so the polling loop doesn't hammer DNS, but stale entries don't linger past 30s. Matches the default for non-SecurityManager JVMs (we don't install one, but stating it explicitly is robust against future config changes).
  • networkaddress.cache.negative.ttl=0 — never cache failures. The cost of one extra DNS query per failed poll is trivial; the cost of the wedge is several minutes of user confusion + a restart.

Asymmetric TTL is the established pattern for long-running JVMs that need to recover from transient DNS issues. AWS SDK (guide), Elasticsearch operations docs, and the Oracle networking-properties reference all recommend this shape.

Why programmatic, not -D flags or install4j config

These are security properties, not system properties — -Dnetworkaddress.cache.ttl=30 does not work. (Confirmed via Oracle JDK docs and OpenJDK source.) The legacy fallback -Dsun.net.inetaddr.ttl=30 works on older JDKs but is undocumented/unstable in modern Java.

Security.setProperty() in main() is the only reliable mechanism that:

  • Works identically across Java 8/11/17/21.
  • Is visible to anyone reading the code.
  • Doesn't require adding a custom java.security overlay file to the install4j build.

Critical timing constraint: the cache initializes lazily and reads Security properties on first DNS lookup. Verified that VCellClientMain.main() (lines 82-91) does no networking before commandLine.execute(args), so setting at the top of main() is safe with margin.

Test plan

  • mvn compile -pl vcell-client -am passes
  • Reproduce the original failure pattern (optional, manual):
    • Run the client from IDE.
    • In another terminal, sudo killall -HUP mDNSResponder while the client is polling — simulates a brief DNS hiccup.
    • Confirm the client recovers within ~1 polling interval after DNS comes back, instead of getting stuck.
  • Ship in next desktop release. No server change needed; the fix is purely client-side.

Out of scope (filed mentally as future work)

  • Java 21+ networkaddress.cache.stale.ttl (JDK-8306653, "stale-data reuse") — different problem; revisit when desktop moves to a newer JVM.
  • HTTP-client connection pooling that may also hold stale state across sleep — orthogonal symptom requiring different mitigation.
  • Server-side variants (-Dnetworkaddress.cache.* for vcell-data, vcell-rest, etc.). The server pods don't sleep so they're less exposed; the OOM-driven failure modes are already addressed by Exit JVM and dump heap on OutOfMemoryError for server containers #1683.

🤖 Generated with Claude Code

When a laptop wakes from sleep, transient DNS resolution failures get
cached by Java's InetAddress cache. Our continuous polling loop then
hits the cached failure faster than the 10s default negative TTL can
expire, wedging the client at "connecting…" until the user restarts
it. The OS resolver is fine — `host vcell-dev.cam.uchc.edu` returns
the correct IP — but the JVM never re-queries it.

Set Security properties at the top of main(), before any DNS lookup:
- networkaddress.cache.ttl=30 (modest positive cache, matches the
  non-SecurityManager default but makes it explicit and stable)
- networkaddress.cache.negative.ttl=0 (never cache failures; recover
  immediately when DNS comes back)

Asymmetric TTL is the established pattern for long-running JVMs that
need to recover from transient DNS issues — see AWS SDK and Oracle
networking-properties guidance. Programmatic Security.setProperty is
the only reliable mechanism: -D JVM args don't bridge to security
properties, and -Dsun.net.inetaddr.ttl is undocumented in modern Java.

Observed in user log:
  >> polling failure << vcell-dev.cam.uchc.edu: nodename nor servname
  provided, or not known
  (repeating, after laptop sleep/wake)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jcschaff jcschaff merged commit ffb6f99 into master May 7, 2026
13 checks passed
@jcschaff jcschaff deleted the client-dns-negative-cache branch May 7, 2026 15:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant