Skip to content

fix(localdns): reduce startup polling interval#8534

Open
jingwenw15 wants to merge 10 commits into
mainfrom
jingwenwu/localdns-sleep-interval-only
Open

fix(localdns): reduce startup polling interval#8534
jingwenw15 wants to merge 10 commits into
mainfrom
jingwenwu/localdns-sleep-interval-only

Conversation

@jingwenw15
Copy link
Copy Markdown
Member

@jingwenw15 jingwenw15 commented May 19, 2026

Summary

  • reduce localdns PID-file polling from 1s to 0.1s while preserving the existing timeout semantics
  • reduce localdns readiness polling from 1s to 0.1s and scale the production call-site attempts accordingly
  • add focused ShellSpec coverage for the 0.1-second polling interval in both helper loops

Testing

  • shellspec --shell bash spec/parts/linux/cloud-init/artifacts/localdns_spec.sh:668
  • shellspec --shell bash spec/parts/linux/cloud-init/artifacts/localdns_spec.sh:733
Screenshot 2026-05-20 at 10 17 15 AM Reduces startup time by around 0.6s on average.

Copilot AI review requested due to automatic review settings May 19, 2026 01:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Reduces localdns startup polling intervals from 1s to 0.1s so the node bootstrap path waits less when CoreDNS comes up quickly, while preserving the existing wall-clock timeout semantics. The PID-file wait now counts in tenths of a second (10× the timeout), and the readiness call-site bumps maxattempts from 60 to 600 to match the 60s timeout at the finer interval.

Changes:

  • Introduce LOCALDNS_POLL_INTERVAL_SECONDS=0.1 and rework start_localdns to use tenth-second counters so START_LOCALDNS_TIMEOUT=10 still yields a ~10s ceiling.
  • Switch wait_for_localdns_ready's sleep to 0.1s and update the production call-site to wait_for_localdns_ready 600 60.
  • Add ShellSpec tests asserting both loops sleep with 0.1.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
parts/linux/cloud-init/artifacts/localdns.sh Faster polling in start_localdns/wait_for_localdns_ready plus updated max-attempt arg at the call-site.
spec/parts/linux/cloud-init/artifacts/localdns_spec.sh New ShellSpec cases asserting the 0.1s sleep interval in both helper loops.

@jingwenw15 jingwenw15 force-pushed the jingwenwu/localdns-sleep-interval-only branch from 2a76993 to ca945a9 Compare May 19, 2026 23:57
@yewmsft
Copy link
Copy Markdown
Member

yewmsft commented May 20, 2026

Two things on the polling-interval drop:

  1. Hardcoded constant in a shared file. LOCALDNS_POLL_INTERVAL_SECONDS=0.1 sits at the top of localdns.sh and is used by both start_localdns and wait_for_localdns_ready. If anyone later tunes one site, they'll silently retune the other. Consider giving each call site its own constant (LOCALDNS_PID_POLL_INTERVAL, LOCALDNS_READY_POLL_INTERVAL) so future changes are explicit about scope.

  2. wait_for_localdns_ready 600 60 is now coupled to the interval. The first arg is MAX_ATTEMPTS, not seconds — at 0.1s/attempt × 600 attempts you get the same 60s wall clock the old 60 60 invocation produced, but only by coincidence. If LOCALDNS_POLL_INTERVAL_SECONDS changes again, this call site silently breaks the timeout. Either compute attempts from a wall-clock budget (attempts = budget_seconds / interval) or rename the arg / refactor to take a duration.

Spec tests look good for verifying the interval value gets passed; consider adding one that asserts the resulting wall-clock timeout (mocked clock) so a future regression of the interval/attempts coupling fails the test rather than slipping through.

Copilot AI review requested due to automatic review settings May 20, 2026 23:39
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comment thread parts/linux/cloud-init/artifacts/localdns.sh
@jingwenw15
Copy link
Copy Markdown
Member Author

@yewmsft Addressed in the follow-up commits on this branch. The polling constants are now split by purpose (LOCALDNS_PID_POLL_INTERVAL_SECONDS and LOCALDNS_READY_POLL_INTERVAL_SECONDS), readiness now takes a real duration via LOCALDNS_READY_TIMEOUT_SECONDS instead of a hand-derived 600 attempt count, and the ShellSpec coverage now includes mocked-clock timeout behavior. I also added a derived max-attempt bound for readiness so the loop still terminates if the system clock does not advance as expected during early boot.

Copilot AI review requested due to automatic review settings May 21, 2026 02:09
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

parts/linux/cloud-init/artifacts/localdns.sh:479

  • This error message says the readiness poll interval is invalid, but calculate_max_poll_attempts can also fail due to an invalid timeout_duration value (or missing awk). Consider broadening the wording and/or logging the timeout + interval values to aid diagnosis.
    max_attempts=$(calculate_max_poll_attempts "${timeout_duration}" "${LOCALDNS_READY_POLL_INTERVAL_SECONDS}") || {
        echo "Invalid localdns readiness poll interval configuration."
        return 1

Comment thread parts/linux/cloud-init/artifacts/localdns.sh
Comment thread parts/linux/cloud-init/artifacts/localdns.sh Outdated
Copilot AI review requested due to automatic review settings May 21, 2026 17:50
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comment thread parts/linux/cloud-init/artifacts/localdns.sh
Copilot AI review requested due to automatic review settings May 21, 2026 23:55
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

@yewmsft
Copy link
Copy Markdown
Member

yewmsft commented May 26, 2026

All three asks addressed — confirmed on ac56c7ad:

  1. Per-site interval constants splitLOCALDNS_PID_POLL_INTERVAL_SECONDS (start-up) and LOCALDNS_READY_POLL_INTERVAL_SECONDS (readiness) at L64-65 are now independent. ✓
  2. calculate_max_poll_attempts derives attempts from a wall-clock budget (L72-95) — call sites pass the duration, not the count. Also nice: explicit non-numeric validation in awk (timeout !~ /^[0-9]+$/, interval !~ /^[0-9]+([.][0-9]+)?$/) so a bad config fails loudly instead of silently coercing to 0. ✓
  3. Caller error path in start_localdns and wait_for_localdns_ready now reports both the duration and interval on failure (L448-465, L473-508). ✓

One nit (non-blocking): wait_for_localdns_ready now has both an elapsedtime check and a max_attempts guard inside the until loop. Belt-and-suspenders is fine — the elapsedtime path is the wall-clock invariant, max_attempts catches a stuck date +%s during early boot — but a one-line comment near L490-ish explaining why both are intentional would save the next person from "simplifying" by deleting one of them.

No vote; comments only.

Copilot AI review requested due to automatic review settings May 27, 2026 00:32
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

@yewmsft
Copy link
Copy Markdown
Member

yewmsft commented May 27, 2026

nit addressed on 40bacba2wait_for_localdns_ready now carries an inline # Keep both guards: elapsed time is the real wall-clock timeout, while max_attempts guarantees termination if date +%s stalls or does not advance as expected. right above the dual-guard block. that's the right framing — it makes the elapsedtime guard the primary invariant and max_attempts the failsafe for a stuck monotonic clock, so the next person reading this won't "simplify" by deleting one. no further comments from me on this PR.

Copy link
Copy Markdown
Member

@yewmsft yewmsft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — logic matches the description; previous feedback (per-site interval constants, budget-derived attempts, dual-guard comment) all addressed.

@yewmsft
Copy link
Copy Markdown
Member

yewmsft commented May 28, 2026

stop re-review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants