Skip to content

fix(nico-dhcp): dual-name kea hook params + require operator IPs#2270

Open
shayan1995 wants to merge 2 commits into
NVIDIA:mainfrom
shayan1995:fix/kea-hook-dual-naming
Open

fix(nico-dhcp): dual-name kea hook params + require operator IPs#2270
shayan1995 wants to merge 2 commits into
NVIDIA:mainfrom
shayan1995:fix/kea-hook-dual-naming

Conversation

@shayan1995
Copy link
Copy Markdown
Contributor

Description

The kea DHCP hook library (C++ in crates/dhcp/src/kea/loader.cc) reads parameters by their pre-rename names — carbide-api-url, carbide-nameservers, carbide-ntpserver, carbide-provisioning-server-ipv4, carbide-metrics-endpoint.

Commit a91aea0 renamed every key the chart writes to the nico-* form without rebuilding the binary. The result on any cluster running the modern chart with a pre-rename DHCP hook binary: getParameter returns null, the binary falls back to its hardcoded localhost defaults (api at [::1]:1079, nameservers/ntpserver at 127.0.0.1, etc.), every hook callout fails with "tcp connect error: ConnectionRefused".

Symptom in the nico-dhcp pod log:
ERROR rpc::forge_tls_client - error connecting client to forge api
(url: https://[::1]:1079)
INFO forge_http_connector::connector - connect error for [::1]:1079:
ConnectError("tcp connect error", Os { code: 111, kind: ConnectionRefused })

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)
  • With site values set (helm-prereqs/values/nico-core.yaml) — both nico-* and carbide-* keys render with the real VIPs, identical values.
  • Without site values — helm template fails with the operator-facing error message; helm lint flags the same warnings.

Additional Notes

@shayan1995 shayan1995 requested a review from a team as a code owner June 5, 2026 21:27
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 5, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: bbad3135-8395-43a3-8d20-c1e6abee0dba

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

The kea DHCP hook library (C++ in crates/dhcp/src/kea/loader.cc) reads
parameters by their pre-rename names — carbide-api-url, carbide-nameservers,
carbide-ntpserver, carbide-provisioning-server-ipv4, carbide-metrics-endpoint.

Commit a91aea0 renamed every key the chart writes to the nico-* form
without rebuilding the binary. The result on any cluster running the
modern chart with a pre-rename DHCP hook binary: getParameter returns
null, the binary falls back to its hardcoded localhost defaults (api at
[::1]:1079, nameservers/ntpserver at 127.0.0.1, etc.), every hook callout
fails with "tcp connect error: ConnectionRefused".

Symptom in the nico-dhcp pod log:
  ERROR rpc::forge_tls_client - error connecting client to forge api
    (url: https://[::1]:1079)
  INFO  forge_http_connector::connector - connect error for [::1]:1079:
    ConnectError("tcp connect error", Os { code: 111, kind: ConnectionRefused })

Fix mirrors the dual-name pattern from PR NVIDIA#2249 (cert SANs and
ExternalName Service aliases): the chart writes both spellings of every
hook parameter with identical values. Old binaries find their carbide-*
keys; rebuilt binaries find their nico-* keys; both pick up the
operator-supplied value rather than localhost defaults.

Drop the carbide-* mirror once every consuming binary has been rebuilt
to read nico-*. A template-side Go comment documents this.

The chart's `nameservers`, `ntpServer`, `provisioningServer` defaults are
also changed from "127.0.0.1" placeholders to obvious
"REPLACE_WITH_..." strings. The old 127.0.0.1 fallback was syntactically
valid IP that produced a silently-non-functional cluster when operators
forgot to override; the new placeholders are loud at kea startup so a
missed override is obvious. The helm-prereqs example site values file
already overrides all three with real example VIPs, so a vanilla
setup.sh run is unaffected.
@shayan1995 shayan1995 force-pushed the fix/kea-hook-dual-naming branch from ade8660 to 3bd84c3 Compare June 5, 2026 22:07
…_mode

The `model::site_explorer::EndpointExplorationReport` struct gained a
required `remediation_error` field upstream, and every other test
construction site in the repo (host_bmc_firmware_test.rs,
site_explorer.rs at 9 sites, etc.) was updated to set it. The
`host_bmc_report()` helper in preingestion_dpu_nic_mode.rs was missed,
leaving every CI build that includes this test failing with:

  error[E0063]: missing field `remediation_error` in initializer of
                `model::site_explorer::EndpointExplorationReport`
    --> crates/api-core/src/tests/preingestion_dpu_nic_mode.rs:53:5

This is a pre-existing upstream issue (still present on main as of the
rebase that produced this branch). Fixing it here as a drive-by so the
chart change in the preceding commit can land — CI is otherwise blocked
on the same break for any branch built from current main.

Sets remediation_error to None to match the convention used by every
other call site.
@shayan1995 shayan1995 requested a review from a team as a code owner June 5, 2026 22:52
tarequeh

This comment was marked as duplicate.

Copy link
Copy Markdown
Contributor

@thossain-nv thossain-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks @shayan1995

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants