Skip to content

controller: post-reboot handling and deployedDigest#58

Merged
alicefr merged 4 commits into
mainfrom
milestone-3b-c7
Jun 12, 2026
Merged

controller: post-reboot handling and deployedDigest#58
alicefr merged 4 commits into
mainfrom
milestone-3b-c7

Conversation

@jlebon

@jlebon jlebon commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Add the key missing part in the rollout logic: we now actually free reboot slots once nodes reboot into the desired image and are Ready. This allows the state machine to continue rolling out to other nodes. We also keep the deployedDigest and updateAvailable fields up to date in the pool status.

With this I think we've more or less reached MVP level for the controller part at least. 🎉 Most of the remaining work is more about richer reporting back to the pool status and strengthening error-handling.

jlebon added 2 commits June 10, 2026 11:13
Currently, nodeStateIdle represents three things:
  1. the node is running the desired image and there's nothing to do
  2. the daemon hasn't checked in at all yet (new node)
  3. the daemon hasn't reacted to a desiredImage change yet

The "nothing to do" case is clearly important to distinguish from the
others. The name 'nodeStateIdle' also is confusing.

With this change, we replace nodeStateIdle with two states:
  1. A new nodeStateUpToDate for case 1 above
  2. A new nodeStatePending for cases 2 and 3 above

Assisted-by: Pi (Claude Opus 4.6)
Might be just me, but I find it confusing that we're testing that nodes
start on B and go to A. It's more natural to expect it the other way
around.

Also the 'booted' and 'target' names weren't clear enough. Switch to
explicit 'old' and 'new' wording.
Comment thread internal/controller/rollout_envtest_test.go
Comment thread internal/controller/rollout_envtest_test.go Outdated
@alicefr

alicefr commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

One question: deployedDigest is never set if any node is degraded or unclassified, even if the rollout itself completed successfully. Is that what we want? Is it a stuck node avoid the rollout to be completed? I guess this follows the same principle as a stuck node holds its reboot slot

jlebon added 2 commits June 11, 2026 15:32
This is the crucial next step for the controller: we actually handle
nodes rebooting into the desired image and free the reboot slot so
rollout can continue on to other nodes.

Assisted-by: Pi (Claude Opus 4.6)
Ouch, this is a pretty obvious bug. The effect wasn't disastrous (we
just waste one reconcile), but still...

Thankfully new tests I'm working on caught it.

Assisted-by: Pi (Claude Opus 4.6)
@jlebon jlebon force-pushed the milestone-3b-c7 branch from d5dc006 to 68bfd5b Compare June 11, 2026 20:54
@jlebon

jlebon commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator Author

One question: deployedDigest is never set if any node is degraded or unclassified, even if the rollout itself completed successfully. Is that what we want? Is it a stuck node avoid the rollout to be completed? I guess this follows the same principle as a stuck node holds its reboot slot

Yeah, thanks for raising this. deployedDigest in general I'm not quite satisfied with to be frank, and I think we should probably drop it. I filed #62 for that and for now dropped that commit from this PR.

@alicefr alicefr merged commit 55db7a1 into main Jun 12, 2026
8 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants