controller: post-reboot handling and deployedDigest#58
Conversation
Currently, nodeStateIdle represents three things: 1. the node is running the desired image and there's nothing to do 2. the daemon hasn't checked in at all yet (new node) 3. the daemon hasn't reacted to a desiredImage change yet The "nothing to do" case is clearly important to distinguish from the others. The name 'nodeStateIdle' also is confusing. With this change, we replace nodeStateIdle with two states: 1. A new nodeStateUpToDate for case 1 above 2. A new nodeStatePending for cases 2 and 3 above Assisted-by: Pi (Claude Opus 4.6)
Might be just me, but I find it confusing that we're testing that nodes start on B and go to A. It's more natural to expect it the other way around. Also the 'booted' and 'target' names weren't clear enough. Switch to explicit 'old' and 'new' wording.
|
One question: deployedDigest is never set if any node is degraded or unclassified, even if the rollout itself completed successfully. Is that what we want? Is it a stuck node avoid the rollout to be completed? I guess this follows the same principle as a stuck node holds its reboot slot |
This is the crucial next step for the controller: we actually handle nodes rebooting into the desired image and free the reboot slot so rollout can continue on to other nodes. Assisted-by: Pi (Claude Opus 4.6)
Ouch, this is a pretty obvious bug. The effect wasn't disastrous (we just waste one reconcile), but still... Thankfully new tests I'm working on caught it. Assisted-by: Pi (Claude Opus 4.6)
Yeah, thanks for raising this. deployedDigest in general I'm not quite satisfied with to be frank, and I think we should probably drop it. I filed #62 for that and for now dropped that commit from this PR. |
Add the key missing part in the rollout logic: we now actually free reboot slots once nodes reboot into the desired image and are Ready. This allows the state machine to continue rolling out to other nodes. We also keep the
deployedDigestandupdateAvailablefields up to date in the pool status.With this I think we've more or less reached MVP level for the controller part at least. 🎉 Most of the remaining work is more about richer reporting back to the pool status and strengthening error-handling.