Don't cache failed DNS lookups in the desktop client#1684
Merged
Conversation
When a laptop wakes from sleep, transient DNS resolution failures get cached by Java's InetAddress cache. Our continuous polling loop then hits the cached failure faster than the 10s default negative TTL can expire, wedging the client at "connecting…" until the user restarts it. The OS resolver is fine — `host vcell-dev.cam.uchc.edu` returns the correct IP — but the JVM never re-queries it. Set Security properties at the top of main(), before any DNS lookup: - networkaddress.cache.ttl=30 (modest positive cache, matches the non-SecurityManager default but makes it explicit and stable) - networkaddress.cache.negative.ttl=0 (never cache failures; recover immediately when DNS comes back) Asymmetric TTL is the established pattern for long-running JVMs that need to recover from transient DNS issues — see AWS SDK and Oracle networking-properties guidance. Programmatic Security.setProperty is the only reliable mechanism: -D JVM args don't bridge to security properties, and -Dsun.net.inetaddr.ttl is undocumented in modern Java. Observed in user log: >> polling failure << vcell-dev.cam.uchc.edu: nodename nor servname provided, or not known (repeating, after laptop sleep/wake) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When a laptop wakes from sleep, the macOS resolver briefly returns
EAI_NONAMEfor VCell internal hostnames while network state catches up. Java'sInetAddresscache picks up the negative result. OurAsynchMessageManagerpolling loop hits the cache faster than the 10-second default negative TTL can expire, so the cached failure never clears and the client wedges at "connecting…" indefinitely. Restarting the client is the only workaround.Observed in
~/.vcell/logs/vcellrun_alpha.log:The OS resolver itself is healthy at the time —
host vcell-dev.cam.uchc.edureturns the correct IP. The cache is purely inside the JVM.Fix
Set asymmetric DNS-cache TTL via
Security.setProperty()at the top ofVCellClientMain.main(), before any networking happens in the JVM:networkaddress.cache.ttl=30— modest positive cache so the polling loop doesn't hammer DNS, but stale entries don't linger past 30s. Matches the default for non-SecurityManager JVMs (we don't install one, but stating it explicitly is robust against future config changes).networkaddress.cache.negative.ttl=0— never cache failures. The cost of one extra DNS query per failed poll is trivial; the cost of the wedge is several minutes of user confusion + a restart.Asymmetric TTL is the established pattern for long-running JVMs that need to recover from transient DNS issues. AWS SDK (guide), Elasticsearch operations docs, and the Oracle networking-properties reference all recommend this shape.
Why programmatic, not
-Dflags or install4j configThese are security properties, not system properties —
-Dnetworkaddress.cache.ttl=30does not work. (Confirmed via Oracle JDK docs and OpenJDK source.) The legacy fallback-Dsun.net.inetaddr.ttl=30works on older JDKs but is undocumented/unstable in modern Java.Security.setProperty()inmain()is the only reliable mechanism that:java.securityoverlay file to the install4j build.Critical timing constraint: the cache initializes lazily and reads
Securityproperties on first DNS lookup. Verified thatVCellClientMain.main()(lines 82-91) does no networking beforecommandLine.execute(args), so setting at the top ofmain()is safe with margin.Test plan
mvn compile -pl vcell-client -ampassessudo killall -HUP mDNSResponderwhile the client is polling — simulates a brief DNS hiccup.Out of scope (filed mentally as future work)
networkaddress.cache.stale.ttl(JDK-8306653, "stale-data reuse") — different problem; revisit when desktop moves to a newer JVM.-Dnetworkaddress.cache.*for vcell-data, vcell-rest, etc.). The server pods don't sleep so they're less exposed; the OOM-driven failure modes are already addressed by Exit JVM and dump heap on OutOfMemoryError for server containers #1683.🤖 Generated with Claude Code