Skip to content

Conversation

@whitslack
Copy link
Collaborator

BOLT 2 says:

A node, […] upon reconnection, if a channel is [not] in an error state, […] MUST wait to receive the other node's channel_reestablish message before sending any other messages for that channel.

We can abuse this requirement to implement a graceful shutdown procedure:

  1. Set a flag that precludes lightningd from sending channel_reestablish messages for any channels that have exactly zero outstanding HTLCs.
  2. Disconnect all peers that have exactly zero outstanding HTLCs in all of their channels with this node.
  3. If no channels are "reestablished" and no channels have any outstanding HTLCs, then it is now impossible for any peers to add any new HTLCs to our channels, so we can safely shut down.
  4. Otherwise, wait for some outstanding HTLC to settle, and then return to step 2.
  5. If graceful shutdown is taking too long, then report to the user the approximate time until an outstanding HTLC next expires, and abort.

This PR has two objectives:

  • Add a snub-idle-channels dynamic config variable that, when set to true, makes lightningd:
    • no longer spawn channeld subdaemons for channels that have no outstanding HTLCs;
    • as an optimization, no longer attempt to auto-reconnect to peers with whom we have no outstanding HTLCs;
    • ignore all received channel_reestablish messages for channels that have no outstanding HTLCs,
      • …but send a warning to the peer informing them that their channel reestablishment is being ignored because this peer imminently will be shutting down.
  • Add a contrib/lightning-graceful-stop.sh script that utilizes snub-idle-channels to implement the graceful shutdown procedure outlined above.

I have tested this graceful shutdown procedure on my own production node with great success. In under a minute my node dropped from over 30 outstanding HTLCs to 14, all of which were "stuck." The shutdown script reported that the next expiration was 140 blocks away, giving me plenty of time to power off my node and perform a hardware upgrade. If I had been willing to wait for all of my outstanding HTLCs to be resolved, then I could have stopped my node indefinitely with no danger of any forced unilateral closures. (Of course, my peers could still voluntarily choose to unilaterally close my channels with them if they grew tired of waiting for my node to reappear in the network, but that's not the concern that graceful shutdown is attempting to address.)

Note that there is still one edge case that this graceful shutdown strategy doesn't solve. If a peer has transmitted a new commitment containing a new HTLC, but we never transmitted our own new commitment containing that same new HTLC (either because we never received the peer's new commitment or because we restarted before we could send our own new commitment), then we will not know about (or will have forgotten) the new HTLC, and we will believe that the channel is safe to snub even though the peer would retransmit their new commitment containing the new HTLC if we allowed them to reestablish the channel. I am not certain, but it may be possible to use the fields in the channel_reestablish message received from the peer to ascertain whether the peer has new HTLCs that they need to retransmit to us, and if they do, then we shouldn't snub the channel even if we are currently aware of no outstanding HTLCs in it.

Checklist

Before submitting the PR, ensure the following tasks are completed. If an item is not applicable to your PR, please mark it as checked:

  • The changelog has been updated in the relevant commit(s) according to the guidelines.
  • Tests have been added or modified to reflect the changes.
  • Documentation has been reviewed and updated as needed.
  • Related issues have been listed and linked, including any that this PR closes.
  • Important All PRs must consider how to reverse any persistent changes for tools/lightning-downgrade

When "snub-idle-channels" is set to true, lightningd will no longer
spawn channeld subdaemons for channels that have no outstanding HTLCs,
and it will cease trying to auto-reconnect to peers with whom we have no
outstanding HTLCs. Incoming channel_reestablish messages for these idle
channels will cause lightningd to reply to the peer with a warning
explaining that we are temporarily declining to reestablish the channel.
Since we do not send our own channel_reestablish, the peer is unable to
add any HTLCs to the channel (or make any other updates to the channel).

The reason we might want to do this is so we can halt a node gracefully
by progressively snubbing more and more channels as they become idle
until eventually we have no outstanding HTLCs whatsoever and also no
possibility of any new HTLCs being added. At that point, we can safely
take our node offline for an extended duration with no possibility that
any of our channels will be unilaterally closed due to HTLC deadlines
while we are offline.

Changelog-Added: New `snub-idle-channels` dynamic config variable makes CLN temporarily stop spawning channeld subdaemons for channels with no HTLCs, as a means to achieve a safe node shutdown.
Issue: ElementsProject#4842
This script utilizes the new "snub-idle-channels" knob to attempt to
stop a CLN node gracefully. The script sets the snub flag and then
starts forcibly disconnecting peers that have one or more reestablished
channels but no outstanding HTLCs. When both the number of reestablished
channels and the number of outstanding HTLCs reach zero, the script
stops the node. If this does not occur before a user-specified timeout,
then the script exits with an error and reports the block height and
approximate time until the next outstanding HTLC expires.

Changelog-Added: `contrib/lightning-graceful-stop.sh` attempts to stop a node without leaving any outstanding HTLCs.
Closes: ElementsProject#4842
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant