resource manager implementation by elnosh · Pull Request #4409 · lightningdevkit/rust-lightning

elnosh · 2026-02-10T22:44:23Z

Part of #4384

This PR introduces a DefaultResourceManager which is based on the proposed mitigation in lightning/bolts#1280.

It only covers the standalone implementation of the mitigation. I have done some testing with integrating it into the ChannelManager but that can be done separately. As mentioned in the issue, the resource manager defines these 4 methods to be called from the channel manager:

add_channel
remove_channel
add_htlc
resolve_htlc

Integrating into the `ChannelManager`

The DefaultResourceManager is intended to be internal to the ChannelManager rather than users instantiating their own and passing it to a ChannelManager constructor.
add/remove_channel should be called when channels are opened/closed.
add_htlc: When processing HTLCs, the channel manager would call add_htlc which returns a ForwardingOutcome telling it whether to forward or fail the HTLC along with the accountable signal to use in case that it should be forwarded. For the initial "read-only" mode, the channel manager would log the results but not actually fail the HTLC if it was told to do so. A bit more specific on where it would be called: I think it will be when processing the forward_htlcs before we queue the add_htlc to the outgoing channel

rust-lightning/lightning/src/ln/channelmanager.rs

Line 7650 in caf0aac

if let Err((reason, msg)) = optimal_channel.queue_add_htlc(
resolve_htlc: Used to tell back the DefaultResourceManager the resolution of an HTLC. It will be used to release bucket resources and update reputation/revenue values internally.

cc @carlaKC

ldk-reviews-bot · 2026-02-10T22:44:26Z

👋 Thanks for assigning @TheBlueMatt as a reviewer!
I'll wait for their review and will help manage the review process.
Once they submit their review, I'll check if a second reviewer would be helpful.

codecov · 2026-02-11T01:27:46Z

Codecov Report

❌ Patch coverage is 97.46159% with 38 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.24%. Comparing base (94d1e5e) to head (61118de).
⚠️ Report is 456 commits behind head on main.

Files with missing lines	Patch %	Lines
lightning/src/ln/resource_manager.rs	97.46%	21 Missing and 17 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4409      +/-   ##
==========================================
+ Coverage   86.03%   87.24%   +1.21%     
==========================================
  Files         156      162       +6     
  Lines      103091   110743    +7652     
  Branches   103091   110743    +7652     
==========================================
+ Hits        88690    96615    +7925     
+ Misses      11891    11628     -263     
+ Partials     2510     2500      -10

Flag	Coverage Δ
fuzzing-fake-hashes	`30.88% <0.00%> (?)`
fuzzing-real-hashes	`22.64% <0.00%> (?)`
tests	`86.32% <97.46%> (+0.29%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

carlaKC

Really great job on this! Done an overly-specific first review round for something that's in draft because I've taken a look at previous versions of this code before when we wrote simulations. Also haven't looked at the tests in detail yet, but coverage is looking ✨ great ✨ .

I think that taking a look at tracking slot usage in GeneralBucket with a single source of truth is worth taking a look at, seems like it could clean up a few places where we need to two hashmap lookups one after the other.

In the interest of one day fuzzing this, I think it could also use some validation that enforces our protocol assumptions (eg, number of slots <= 483).

ldk-reviews-bot · 2026-02-13T14:07:40Z

👋 The first review has been submitted!

Do you think this PR is ready for a second reviewer? If so, click here to assign a second reviewer.

elnosh · 2026-02-16T22:43:21Z

think I have addressed most of the comments code-wise. Still need to add some requested comments/docs changes.

elnosh · 2026-02-17T18:33:51Z

pushed more fixups addressing requests for adding docs/comments, lmk if those look good

carlaKC · 2026-02-20T06:39:35Z

+	/// Tracks the occupancy of HTLC slots in the bucket.
+	slots_occupied: Vec<bool>,
+
+	/// SCID -> (slots assigned, salt)
+	/// Maps short channel IDs to an array of tuples with the slots that the channel is allowed
+	/// to use and the current usage state for each slot. It also stores the salt used to
+	/// generate the slots for the channel. This is used to deterministically generate the
+	/// slots for each channel on restarts.
+	channels_slots: HashMap<u64, (Vec<(u16, bool)>, [u8; 32])>,


this shouldn't accidentally double-assign them.

Yeah it shouldn't (provided we don't have bugs), but tracking the same information (whether a slot is occupied) in multiple places is a design that allows for inconsistency / the possibility of bugs. If we have a single source of truth, we move from "shouldn't double assign" to "can't double assign".

Gave it a shot here, lmk what you think!

TheBlueMatt · 2026-02-22T23:23:33Z

First of all not sure why all your commit messages are line-wrapped at 40 chars, but you can use like 60 or 70 lol.

TheBlueMatt

A few comments, I think the design is fine, but startup resync may be annoying.

TheBlueMatt · 2026-02-22T23:33:21Z

+	}
+}
+
+/// Tracks an average value over multiple rolling windows to smooth out volatility.


I'm kinda confused by this struct. First of all, the docs here are wrong - we aren't tracking "multiple windows" we're tracking a rolling average over one window of window * window_count. The only difference between this and DecayingAverage is it tries to compensate for if we don't have enough data to actually go back window_count * window. Why shouldn't we just have DecayingAverage do that instead of having a separate struct here?

I think it makes sense to keep separate because the use of DecayingAverage for reputation differs from AggregatedWindowAverage when tracking revenue. For reputation, we want the DecayingAverage over the full window (24 weeks). For revenue, using AggregatedWindowAverage, we track the decaying average over the same window (24 weeks) but divide by window_count because we want the revenue for 2 weeks.

I agree that we want to track two different things here:

Reputation (as DecayingAverage): we want shocks to reflect, so that we can quickly react to a change in attacker behavior

Revenue (as AggregatedWindowAverage): we want to smooth shocks to track our peer's average revenue in two weeks over a window_count periods.

But ran some numbers and it does look like we're penalizing old data a bit too much with this approach, as mentioned below.

TheBlueMatt · 2026-02-22T23:35:03Z

+struct DecayingAverage {
+	value: i64,
+	last_updated_unix_secs: u64,
+	window: Duration,


You don't actually use window (only decay_rate) so we can drop it here.

yeah, I was keeping window for the decay_rate when reading back here https://github.com/elnosh/rust-lightning/blob/90943195bee498f34247f65a68a8511d57997aae/lightning/src/ln/resource_manager.rs#L1042-L1045

Seems okay to me to just write the decay_rate directly. We'd only need the window if we wanted to change the way that we calculate it, and that seems unlikely?

TheBlueMatt · 2026-02-22T23:50:05Z

+		// We are not concerned with the rounding precision loss for this value because it is
+		// negligible when dealing with a long rolling average.
+		Ok((self.aggregated_revenue_decaying.value_at_timestamp(timestamp_unix_secs)? as f64
+			/ window_divisor)


I don't buy this? Let's say our windows_tracked is 4 and we have some data for the last 3 windows. On average, those 3 windows worth of data data will have been multiplied by 0.62175 (https://www.wolframalpha.com/input?i=%28integral+from+0+to+3+%280.5+%5E+0.5%29+%5E+x%29+%2F+3) but then we divide it by three. Whereas if we only have data for a single-window, that data will multiplied by, on average, 0.845111 (https://www.wolframalpha.com/input?i=%28integral+from+0+to+1+%280.5+%5E+0.5%29+%5E+x%29+%2F+1), and then we'll divide by one. We have to factor in the decrease in the data from the decay as well as just the increased amount of data here.

TheBlueMatt · 2026-02-23T00:05:32Z

+	/// Tracks the occupancy of HTLC slots in the bucket.
+	slots_occupied: Vec<bool>,
+
+	/// SCID -> (slots assigned, salt)
+	/// Maps short channel IDs to an array of tuples with the slots that the channel is allowed
+	/// to use and the current usage state for each slot. It also stores the salt used to
+	/// generate the slots for the channel. This is used to deterministically generate the
+	/// slots for each channel on restarts.
+	channels_slots: HashMap<u64, (Vec<(u16, bool)>, [u8; 32])>,


Does the protection algorithm break if slots are allocated probabilistically? We could reduce implementation complexity a good bit if we just drop channel_slots entirely and generate the list of slots the channel can occupy any time we need it and allow two channels to occupy the same slot (presumably leading to some extra HTLC failures in that case?). This feels very much like a bloom filter problem where we should be able to reduce FPs somehow, though maybe it isn't quite the same because we actually do want conflicts to be "common".

TheBlueMatt · 2026-02-23T00:45:13Z

+	}
+}
+
+impl Readable for DefaultResourceManager {


Hmmmmmmmmmmmmmmmmmmm. Reconciliation on startup is gonna be tricky here. What happens if we accept an HTLC then restart and actually it never made it to disk in the ChannelMonitor? Theoretically this can be persisted as a part of ChannelManager and it should be consistent-ish, but Val is hard at work making it so that we don't have to persist ChannelManager at all.

Instead, I wonder how easy we can make it to rebuild this from HTLC information. It would require some additional integration into "LDK core" but hopefully not much. If we have some HTLCSlotUsage struct that we return from add_htlc in the ForwardingOutcome::Forward case, we could presumably shove that into the HTLCSource (as the lots are "on" the inbound channel) and rebuild the resource manager very cheaply.

What happens if we accept an HTLC then restart and actually it never made it to disk in the ChannelMonitor? Theoretically this can be persisted as a part of ChannelManager and it should be consistent-ish, but Val is hard at work making it so that we don't have to persist ChannelManager at all.

hmmmm yeah I thought about that but was operating under the assumption that by persisting along with the ChannelManager it should stay consistent.

In a world where we don't persist the ChannelManager I was exploring your suggestion to rebuild the resource manager from HTLC data we have on startup and came up with the approach here: elnosh@cdd0bf8 With some caveats, I think we can replay HTLCs by calling add_htlc on the ResourceManager so we would only need general HTLC information and no need to shove bucket/resourcemanager specific information into HTLCSource. We would basically need this HTLC info on startup. I added 2 helper methods in channel.rs and the replay on the ChannelManager could look like this https://github.com/elnosh/rust-lightning/blob/cdd0bf80cb200d370995c4f859645c0a54b3a798/lightning/src/ln/channelmanager.rs#L19303-L19366

With this, I was able to restart a node with pending HTLCs and replayed them fine in the resource manager using Channel data. The only field I would need to add to HTLCSource is incoming_accountable

The caveat is that reputation and in-flight-risk when replaying the HTLCs might be somewhat (slightly) different if the shutdown time was long because the current timestamp is different.

Another approach would be to store the specific bucket usage in the HTLCSource so we replay HTLCs and add them directly to the bucket they were before shutdown. I went with previous approach mentioned since I think that will be less intrusive in the channel manager and would require less resourcemanager-specific information to leak into the channel manager. Let me know what you think

My only question there is what the performance cost is. If we have 500 channels and have to replay a hundred HTLCs per channel how bad does it get?

I'd have to run it but, indeed, it is not optimal because for each outbound HTLC in each channel it needs to lookup the inbound htlc on the incoming channel. It could store the missing fields in the HTLCSource as well to avoid the inbound htlc lookup.

did alternative approach in 9094319

elnosh · 2026-03-03T14:02:20Z

I have pushed changes for majority of comments from last round - diff here.

The most notable things are:

added a PendingHTLCReplay to be passed from upstream by the ChannelManager to replay pending HTLCs on startup instead of writing them in the ResourceManager
Do not double-track HTLC slot occupancy in general bucket and only track them in slots_occupied.
Use ChaCha instead of sha256 for slot generation in general bucket
Added more test cases

ldk-reviews-bot · 2026-03-05T14:41:06Z

🔔 1st Reminder

Hey @carlaKC! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

ldk-reviews-bot · 2026-03-07T14:41:45Z

🔔 2nd Reminder

Hey @carlaKC! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

ldk-reviews-bot · 2026-03-09T14:42:45Z

🔔 3rd Reminder

Hey @carlaKC! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

carlaKC

Didn't review tests yet, main comment is about how we handle replays on restart (+ saving needing to persist a few things).

carlaKC · 2026-03-09T12:58:12Z

+struct DecayingAverage {
+	value: i64,
+	last_updated_unix_secs: u64,
+	window: Duration,


Seems okay to me to just write the decay_rate directly. We'd only need the window if we wanted to change the way that we calculate it, and that seems unlikely?

carlaKC · 2026-03-09T14:13:02Z

+	}
+}
+
+/// Tracks an average value over multiple rolling windows to smooth out volatility.


I agree that we want to track two different things here:

Reputation (as DecayingAverage): we want shocks to reflect, so that we can quickly react to a change in attacker behavior

Revenue (as AggregatedWindowAverage): we want to smooth shocks to track our peer's average revenue in two weeks over a window_count periods.

But ran some numbers and it does look like we're penalizing old data a bit too much with this approach, as mentioned below.

carlaKC · 2026-03-09T14:55:11Z

+			// TODO: could return the slots already assigned instead of erroring.
+			Entry::Occupied(_) => Err(()),


Meant that assign_slots_for_channel doesn't need &self at all - we can just pass in our_scid + per_channel_slots, return the slots/salt we're adding and then have the caller be responsible for adding these values to self.channel_slots.

Saves us a double lookup because we're looking up the entry in the caller (to see if we need to assign_slots_for_channel and looking up again here).

elnosh · 2026-04-08T12:19:13Z

addressed last comments and made some changes to handle HTLC resolution after force-closes. If a channel has been force-closed, we'll keep it around until all pending HTLCs have been resolved.

carlaKC

This is ready for a second reviewer IMO!

Summary of major points we've discussed so far:

Closed channels: we keep closed channels around in the ResoruceManager's state so that we don't need special handling for HTLC resolutions that come after the channel has been closed. This means we have to scan for any usage of our closed channel as an incoming source, but it's not unreasonable given the protocol limits on max HTLC count.
Clock progression: the resource manager is tolerant to our clocktime moving backwards to account for the possibility of clock skew. If a time in the past is reported, we simply clamp it to the last (greater) timestamp we were given. An exception to this is the case where a HTLC is added/removed, because we expect several networking roundtrips in between so can reasonably expect the clock to more forward in this case - if the clock moves backwards here, we've likely got a bug in the calling code. By contrast, we could add/remove two distinct HTLCs in very close succession and can't distinguish between clock skew and a bug in the caller. Clamping has little impact on our reputation scheme, so we err on the side of being error tolerant.
Error handling: in the PR that follows, we'll debug_assert and log if there are any ResourceManager errors. We don't want to shut down, because we're not at risk of losing funds, but we do want people to report that something has gone wrong.
HTLC removal best effort: if we have a bug in our HTLC removal (say, we specify the wrong incoming channel and can't find the HTLC in the ResourceManager, and thus can't remove it), we're at risk of degraded routing performance if the ResourceManager's view of the HTLC state becomes out of sync with reality (when we turn this on for real). Even though this constitutes a pretty dire bug (which we should catch it in review/with debug_assert), we still attempt to handle this gracefully by cleaning up all the resources we're able to.

Some followup items we shouldn't forget:

Allow HTLCs to use resources across buckets. Eg: if a HTLC is just too big for protected, and the peer has sufficient reputation there's no reason not to allow them some of the space in general as well. Right now we restrict to fitting exactly in a single bucket, where we could probably be more permissive. This is okay for read-only, and we'll be able to get an idea whether it's actually a problem once we get some data out of the read-only impl. But worth remembering.
Right now if we add a channel with < 12 HTLCs, we won't add it to our state and we'll fail every time a HTLC using that channel is added/removed (because we didn't track it). This seems fine (to me) for a readonly impl, but we should add handle this more gracefully (as not adding a channel means it can be used to grief us).
We don't have support for the upgrade_accountability signal from the onion (which is relevant for bumping up to congestion or protected). When we add this to LDK, we should hook it into this system.
If a channel has zero fees, just use the default fee that "would have been" for reputation purposes.
Adding some informative logging in a follow up.

The AggregatedWindowAverage implemented here will be used in upcoming commits to track the incoming revenue that channels have generated through HTLC forwards.

Resources available in the channel will be divided into general, congestion and protected resources. Here we implement the general bucket with basic denial of service protections. Co-authored-by: Carla Kirk-Cohen <kirkcohenc@gmail.com>

Resources available in the channel will be divided into general, congestion and protected resources. Here we implement the bucket resources that will be used for congestion and protected.

The Channel struct introduced here has the core information that will be used by the resource manager to make forwarding decisions on HTLCs: - Reputation that this channel has accrued as an outgoing link in HTLC forwards. - Revenue (forwarding fees) that the channel has earned us as an incoming link. - Pending HTLCs this channel is currently holding as an outgoing link. - Bucket resources that are currently in use in general, congestion and protected.

Introduces the DefaultResourceManager struct. The core of methods that will be used to inform the HTLC forward decisions are add/resolve_htlc. - add_htlc: Based on resource availability and reputation, it evaluates whehther to forward or fail the HTLC. - resolve_htlc: Releases the bucket resources used from a HTLC previously added and updates the channel's reputation based on HTLC fees and resolution times.

ldk-claude-review-bot · 2026-04-28T15:31:55Z

+			u8::try_from((slots_allocated * MIN_GENERAL_BUCKET_SLOTS).div_ceil(100)).unwrap(),
+		);
+
+		let per_slot_msat = liquidity_allocated * per_channel_slots as u64 / slots_allocated as u64;


Bug: u64 overflow in per_slot_msat computation.

liquidity_allocated * per_channel_slots as u64 can overflow u64 with non-default configurations and near-maximum channel values. Specifically:

add_channel allows max_htlc_value_in_flight_msat up to ~2.1 × 10^18 msat

With general_pct = 60: general_liquidity ≈ 1.26 × 10^18, per_channel_slots = 15 (from 483 HTLCs at 60%), product = 1.89 × 10^19 > u64::MAX (1.84 × 10^19)

With general_pct = 80: product = 3.36 × 10^19 — far beyond u64::MAX

The wrapped result produces a much smaller per_slot_msat than intended, making each channel pair's liquidity allowance in the general bucket far too small, effectively DoSing general-bucket forwarding for large channels.

The default general_pct = 40 is safe (product ≈ 8.4 × 10^18), but any setting above ~55% with large channels will overflow.

Fix: use u128 for the intermediate computation:

Suggested change

let per_slot_msat = liquidity_allocated * per_channel_slots as u64 / slots_allocated as u64;

let per_slot_msat = (liquidity_allocated as u128 * per_channel_slots as u128 / slots_allocated as u128) as u64;

ldk-claude-review-bot · 2026-04-28T15:32:07Z

+	}
+
+	fn resources_available(&self, htlc_amount_msat: u64) -> bool {
+		return (self.liquidity_used + htlc_amount_msat <= self.liquidity_allocated)


Bug (defense-in-depth): self.liquidity_used + htlc_amount_msat can overflow u64, wrapping to a small value that incorrectly passes the <= self.liquidity_allocated check.

For example: liquidity_used = 100, htlc_amount_msat = u64::MAX → sum wraps to 99, which is <= liquidity_allocated for any reasonable allocation. This would allow an oversized HTLC to bypass the liquidity check and get added to the bucket.

The resource manager relies on the ChannelManager to validate HTLC amounts before calling add_htlc, so this isn't directly exploitable through the intended integration path. However, the resource manager performs no explicit upper-bound validation on incoming_amount_msat in its own add_htlc method, so this is a latent vulnerability if calling conventions change.

Fix: use checked_add or saturating_add:

Suggested change

return (self.liquidity_used + htlc_amount_msat <= self.liquidity_allocated)

return self.liquidity_used.checked_add(htlc_amount_msat)

.map_or(false, |total| total <= self.liquidity_allocated)

&& (self.slots_used < self.slots_allocated);

ldk-reviews-bot · 2026-04-30T16:27:53Z

🔔 1st Reminder