resource manager trait and impl#4409
Conversation
|
👋 Thanks for assigning @carlaKC as a reviewer! |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4409 +/- ##
==========================================
+ Coverage 86.03% 86.29% +0.26%
==========================================
Files 156 161 +5
Lines 103091 109078 +5987
Branches 103091 109078 +5987
==========================================
+ Hits 88690 94134 +5444
- Misses 11891 12286 +395
- Partials 2510 2658 +148
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
carlaKC
left a comment
There was a problem hiding this comment.
Really great job on this! Done an overly-specific first review round for something that's in draft because I've taken a look at previous versions of this code before when we wrote simulations. Also haven't looked at the tests in detail yet, but coverage is looking ✨ great ✨ .
I think that taking a look at tracking slot usage in GeneralBucket with a single source of truth is worth taking a look at, seems like it could clean up a few places where we need to two hashmap lookups one after the other.
In the interest of one day fuzzing this, I think it could also use some validation that enforces our protocol assumptions (eg, number of slots <= 483).
|
👋 The first review has been submitted! Do you think this PR is ready for a second reviewer? If so, click here to assign a second reviewer. |
|
think I have addressed most of the comments code-wise. Still need to add some requested comments/docs changes. |
|
pushed more fixups addressing requests for adding docs/comments, lmk if those look good |
lightning/src/ln/resource_manager.rs
Outdated
| /// Tracks the occupancy of HTLC slots in the bucket. | ||
| slots_occupied: Vec<bool>, | ||
|
|
||
| /// SCID -> (slots assigned, salt) | ||
| /// Maps short channel IDs to an array of tuples with the slots that the channel is allowed | ||
| /// to use and the current usage state for each slot. It also stores the salt used to | ||
| /// generate the slots for the channel. This is used to deterministically generate the | ||
| /// slots for each channel on restarts. | ||
| channels_slots: HashMap<u64, (Vec<(u16, bool)>, [u8; 32])>, |
There was a problem hiding this comment.
this shouldn't accidentally double-assign them.
Yeah it shouldn't (provided we don't have bugs), but tracking the same information (whether a slot is occupied) in multiple places is a design that allows for inconsistency / the possibility of bugs. If we have a single source of truth, we move from "shouldn't double assign" to "can't double assign".
Gave it a shot here, lmk what you think!
|
First of all not sure why all your commit messages are line-wrapped at 40 chars, but you can use like 60 or 70 lol. |
TheBlueMatt
left a comment
There was a problem hiding this comment.
A few comments, I think the design is fine, but startup resync may be annoying.
lightning/src/ln/resource_manager.rs
Outdated
| } | ||
| } | ||
|
|
||
| /// Tracks an average value over multiple rolling windows to smooth out volatility. |
There was a problem hiding this comment.
I'm kinda confused by this struct. First of all, the docs here are wrong - we aren't tracking "multiple windows" we're tracking a rolling average over one window of window * window_count. The only difference between this and DecayingAverage is it tries to compensate for if we don't have enough data to actually go back window_count * window. Why shouldn't we just have DecayingAverage do that instead of having a separate struct here?
There was a problem hiding this comment.
I think it makes sense to keep separate because the use of DecayingAverage for reputation differs from AggregatedWindowAverage when tracking revenue. For reputation, we want the DecayingAverage over the full window (24 weeks). For revenue, using AggregatedWindowAverage, we track the decaying average over the same window (24 weeks) but divide by window_count because we want the revenue for 2 weeks.
There was a problem hiding this comment.
I agree that we want to track two different things here:
Reputation(asDecayingAverage): we want shocks to reflect, so that we can quickly react to a change in attacker behaviorRevenue(asAggregatedWindowAverage): we want to smooth shocks to track our peer's average revenue in two weeks over awindow_countperiods.
But ran some numbers and it does look like we're penalizing old data a bit too much with this approach, as mentioned below.
| struct DecayingAverage { | ||
| value: i64, | ||
| last_updated_unix_secs: u64, | ||
| window: Duration, |
There was a problem hiding this comment.
You don't actually use window (only decay_rate) so we can drop it here.
There was a problem hiding this comment.
yeah, I was keeping window for the decay_rate when reading back here https://github.com/elnosh/rust-lightning/blob/90943195bee498f34247f65a68a8511d57997aae/lightning/src/ln/resource_manager.rs#L1042-L1045
There was a problem hiding this comment.
Seems okay to me to just write the decay_rate directly. We'd only need the window if we wanted to change the way that we calculate it, and that seems unlikely?
lightning/src/ln/resource_manager.rs
Outdated
| // We are not concerned with the rounding precision loss for this value because it is | ||
| // negligible when dealing with a long rolling average. | ||
| Ok((self.aggregated_revenue_decaying.value_at_timestamp(timestamp_unix_secs)? as f64 | ||
| / window_divisor) |
There was a problem hiding this comment.
I don't buy this? Let's say our windows_tracked is 4 and we have some data for the last 3 windows. On average, those 3 windows worth of data data will have been multiplied by 0.62175 (https://www.wolframalpha.com/input?i=%28integral+from+0+to+3+%280.5+%5E+0.5%29+%5E+x%29+%2F+3) but then we divide it by three. Whereas if we only have data for a single-window, that data will multiplied by, on average, 0.845111 (https://www.wolframalpha.com/input?i=%28integral+from+0+to+1+%280.5+%5E+0.5%29+%5E+x%29+%2F+1), and then we'll divide by one. We have to factor in the decrease in the data from the decay as well as just the increased amount of data here.
lightning/src/ln/resource_manager.rs
Outdated
| /// Tracks the occupancy of HTLC slots in the bucket. | ||
| slots_occupied: Vec<bool>, | ||
|
|
||
| /// SCID -> (slots assigned, salt) | ||
| /// Maps short channel IDs to an array of tuples with the slots that the channel is allowed | ||
| /// to use and the current usage state for each slot. It also stores the salt used to | ||
| /// generate the slots for the channel. This is used to deterministically generate the | ||
| /// slots for each channel on restarts. | ||
| channels_slots: HashMap<u64, (Vec<(u16, bool)>, [u8; 32])>, |
There was a problem hiding this comment.
Does the protection algorithm break if slots are allocated probabilistically? We could reduce implementation complexity a good bit if we just drop channel_slots entirely and generate the list of slots the channel can occupy any time we need it and allow two channels to occupy the same slot (presumably leading to some extra HTLC failures in that case?). This feels very much like a bloom filter problem where we should be able to reduce FPs somehow, though maybe it isn't quite the same because we actually do want conflicts to be "common".
lightning/src/ln/resource_manager.rs
Outdated
| } | ||
| } | ||
|
|
||
| impl Readable for DefaultResourceManager { |
There was a problem hiding this comment.
Hmmmmmmmmmmmmmmmmmmm. Reconciliation on startup is gonna be tricky here. What happens if we accept an HTLC then restart and actually it never made it to disk in the ChannelMonitor? Theoretically this can be persisted as a part of ChannelManager and it should be consistent-ish, but Val is hard at work making it so that we don't have to persist ChannelManager at all.
Instead, I wonder how easy we can make it to rebuild this from HTLC information. It would require some additional integration into "LDK core" but hopefully not much. If we have some HTLCSlotUsage struct that we return from add_htlc in the ForwardingOutcome::Forward case, we could presumably shove that into the HTLCSource (as the lots are "on" the inbound channel) and rebuild the resource manager very cheaply.
There was a problem hiding this comment.
What happens if we accept an HTLC then restart and actually it never made it to disk in the
ChannelMonitor? Theoretically this can be persisted as a part ofChannelManagerand it should be consistent-ish, but Val is hard at work making it so that we don't have to persistChannelManagerat all.
hmmmm yeah I thought about that but was operating under the assumption that by persisting along with the ChannelManager it should stay consistent.
In a world where we don't persist the ChannelManager I was exploring your suggestion to rebuild the resource manager from HTLC data we have on startup and came up with the approach here: elnosh@cdd0bf8 With some caveats, I think we can replay HTLCs by calling add_htlc on the ResourceManager so we would only need general HTLC information and no need to shove bucket/resourcemanager specific information into HTLCSource. We would basically need this HTLC info on startup. I added 2 helper methods in channel.rs and the replay on the ChannelManager could look like this https://github.com/elnosh/rust-lightning/blob/cdd0bf80cb200d370995c4f859645c0a54b3a798/lightning/src/ln/channelmanager.rs#L19303-L19366
With this, I was able to restart a node with pending HTLCs and replayed them fine in the resource manager using Channel data. The only field I would need to add to HTLCSource is incoming_accountable
The caveat is that reputation and in-flight-risk when replaying the HTLCs might be somewhat (slightly) different if the shutdown time was long because the current timestamp is different.
Another approach would be to store the specific bucket usage in the HTLCSource so we replay HTLCs and add them directly to the bucket they were before shutdown. I went with previous approach mentioned since I think that will be less intrusive in the channel manager and would require less resourcemanager-specific information to leak into the channel manager. Let me know what you think
There was a problem hiding this comment.
My only question there is what the performance cost is. If we have 500 channels and have to replay a hundred HTLCs per channel how bad does it get?
There was a problem hiding this comment.
I'd have to run it but, indeed, it is not optimal because for each outbound HTLC in each channel it needs to lookup the inbound htlc on the incoming channel. It could store the missing fields in the HTLCSource as well to avoid the inbound htlc lookup.
|
I have pushed changes for majority of comments from last round - diff here. The most notable things are:
|
|
🔔 1st Reminder Hey @carlaKC! This PR has been waiting for your review. |
|
🔔 2nd Reminder Hey @carlaKC! This PR has been waiting for your review. |
|
🔔 3rd Reminder Hey @carlaKC! This PR has been waiting for your review. |
carlaKC
left a comment
There was a problem hiding this comment.
Didn't review tests yet, main comment is about how we handle replays on restart (+ saving needing to persist a few things).
| struct DecayingAverage { | ||
| value: i64, | ||
| last_updated_unix_secs: u64, | ||
| window: Duration, |
There was a problem hiding this comment.
Seems okay to me to just write the decay_rate directly. We'd only need the window if we wanted to change the way that we calculate it, and that seems unlikely?
lightning/src/ln/resource_manager.rs
Outdated
| } | ||
| } | ||
|
|
||
| /// Tracks an average value over multiple rolling windows to smooth out volatility. |
There was a problem hiding this comment.
I agree that we want to track two different things here:
Reputation(asDecayingAverage): we want shocks to reflect, so that we can quickly react to a change in attacker behaviorRevenue(asAggregatedWindowAverage): we want to smooth shocks to track our peer's average revenue in two weeks over awindow_countperiods.
But ran some numbers and it does look like we're penalizing old data a bit too much with this approach, as mentioned below.
lightning/src/ln/resource_manager.rs
Outdated
| // TODO: could return the slots already assigned instead of erroring. | ||
| Entry::Occupied(_) => Err(()), |
There was a problem hiding this comment.
Meant that assign_slots_for_channel doesn't need &self at all - we can just pass in our_scid + per_channel_slots, return the slots/salt we're adding and then have the caller be responsible for adding these values to self.channel_slots.
Saves us a double lookup because we're looking up the entry in the caller (to see if we need to assign_slots_for_channel and looking up again here).
| self.slots_used += 1; | ||
| self.liquidity_used += htlc_amount_msat; |
There was a problem hiding this comment.
nit: debug_assert that we never go over our _allocated values?
There was a problem hiding this comment.
isn't that caught by the resources_available check above this?
There was a problem hiding this comment.
Fair - it's just very cheap to do and a pretty fundamental part of the impl so figured why not?
|
pushed changes addressing comments from last review. Changes to point out are to |
| fn value_at_timestamp(&mut self, timestamp_unix_secs: u64) -> Result<i64, ()> { | ||
| if timestamp_unix_secs < self.last_updated_unix_secs { | ||
| return Err(()); | ||
| } | ||
|
|
||
| let elapsed_secs = (timestamp_unix_secs - self.last_updated_unix_secs) as f64; | ||
| let decay_rate = 0.5_f64.powf(elapsed_secs / self.half_life); | ||
| self.value = (self.value as f64 * decay_rate).round() as i64; | ||
| self.last_updated_unix_secs = timestamp_unix_secs; |
There was a problem hiding this comment.
Issue: value_at_timestamp mutates last_updated_unix_secs on every call, making it a one-way ratchet. If any caller passes a timestamp slightly ahead of the current time (e.g., due to clock skew, NTP adjustment, or VM migration), all subsequent calls with the "correct" time will return Err(()) until the wall clock catches up.
In resolve_htlc, this manifests as: one call with resolved_at slightly in the future poisons the channel's outgoing_reputation.last_updated_unix_secs. All subsequent add_htlc calls fail (line 768 calls value_at_timestamp(added_at) which returns Err), effectively blocking all HTLC forwarding through this outgoing channel until the system clock reaches the poisoned timestamp.
Consider clamping to max(timestamp, last_updated) instead of returning Err, or using monotonic timestamps internally. Alternatively, at minimum, document that callers must guarantee strictly non-decreasing timestamps to avoid bricking a channel's forwarding.
There was a problem hiding this comment.
Consider clamping to max(timestamp, last_updated)
This seems fine to me, it'll just mean that we think our HTLC is held for a second or two more than we expect which isn't critical.
There was a problem hiding this comment.
Discussed offline: go with clamping because we need to be able to clear out HTLCs - if we happen to have clock drift on a HTLC remove, it'll get stuck so we can't actually error on the remove path at all or we risk "stuck resources".
| fn value_at_timestamp(&mut self, timestamp_unix_secs: u64) -> Result<i64, ()> { | ||
| if timestamp_unix_secs < self.last_updated_unix_secs { | ||
| return Err(()); | ||
| } | ||
|
|
||
| let elapsed_secs = (timestamp_unix_secs - self.last_updated_unix_secs) as f64; | ||
| let decay_rate = 0.5_f64.powf(elapsed_secs / self.half_life); | ||
| self.value = (self.value as f64 * decay_rate).round() as i64; | ||
| self.last_updated_unix_secs = timestamp_unix_secs; |
There was a problem hiding this comment.
Consider clamping to max(timestamp, last_updated)
This seems fine to me, it'll just mean that we think our HTLC is held for a second or two more than we expect which isn't critical.
Implements a decaying average over a rolling window. It will be used in upcoming commits by the resource manager to track reputation and revenue of channels.
The AggregatedWindowAverage implemented here will be used in upcoming commits to track the incoming revenue that channels have generated through HTLC forwards.
Resources available in the channel will be divided into general, congestion and protected resources. Here we implement the general bucket with basic denial of service protections. Co-authored-by: Carla Kirk-Cohen <kirkcohenc@gmail.com>
Resources available in the channel will be divided into general, congestion and protected resources. Here we implement the bucket resources that will be used for congestion and protected.
The Channel struct introduced here has the core information that will be used by the resource manager to make forwarding decisions on HTLCs: - Reputation that this channel has accrued as an outgoing link in HTLC forwards. - Revenue (forwarding fees) that the channel has earned us as an incoming link. - Pending HTLCs this channel is currently holding as an outgoing link. - Bucket resources that are currently in use in general, congestion and protected.
Introduces the DefaultResourceManager struct. The core of methods that will be used to inform the HTLC forward decisions are add/resolve_htlc. - add_htlc: Based on resource availability and reputation, it evaluates whehther to forward or fail the HTLC. - resolve_htlc: Releases the bucket resources used from a HTLC previously added and updates the channel's reputation based on HTLC fees and resolution times.
Adds write and read implementations to persist the DefaultResourceManager.
carlaKC
left a comment
There was a problem hiding this comment.
Discussed error handling offline in a bit more detail:
- Error if we create a resource manager that has invalid configuration (we won't right now, because it's hardcoded but in general we shouldn't start if we're given a bad config).
debug_assertthat we don't fail onadd/remove_htlcin channelmanager- Failures on adding
Channel::newmeans that the channel is too small to meaningfully protect, we return an error fromadd_channeland log in channelmanager
| } | ||
|
|
||
| impl AggregatedWindowAverage { | ||
| fn new(avg_weeks: u8, window_multiplier: u8, start_timestamp_unix_secs: u64) -> Self { |
There was a problem hiding this comment.
We should fail if avg_weeks/window_multiplier are 0?
| let general_slot_allocation = | ||
| u8::max(5, u8::try_from((slots_allocated * 5).div_ceil(100)).unwrap()); | ||
|
|
||
| let general_liquidity_allocation = |
There was a problem hiding this comment.
nit: can be per_slot_msat so that we can use shorthand field init below
| // General bucket will assign 5 slots of 500 per channel. Max 5 * 500 = 2500 | ||
| // Adding an HTLC over the amount should return error. | ||
| let add_htlc_res = general_bucket.add_htlc(scid, htlc_amount_over_max, &entropy_source); |
There was a problem hiding this comment.
nit: in these tests, add asserts on the expectedper_channel_slots/per_slot_msat amounts so that we're certain we're testing the right thing
| self.slots_used += 1; | ||
| self.liquidity_used += htlc_amount_msat; |
There was a problem hiding this comment.
Fair - it's just very cheap to do and a pretty fundamental part of the impl so figured why not?
lightning/src/ln/resource_manager.rs
Outdated
| debug_assert!( | ||
| general_bucket_slots_allocated >= 5, | ||
| "5 is the minimum we need for general bucket" | ||
| ); |
There was a problem hiding this comment.
Eurgh sorry to do think but I think this should live in GeneralBucket - since the 5 consant isn't something that channel really knows about?
Likewise for liquidity, put the validation specific to the general bucket in its constructor and leave the rest out here.
| fn read<R: Read>( | ||
| reader: &mut R, args: (u64, u16, &ResourceManagerConfig, &ES), | ||
| ) -> Result<Self, DecodeError> { | ||
| let (max_htlc_value_in_flight_msat, max_accepted_htlcs, config, entropy_source) = args; |
| avg_weeks: u8, | ||
| window_weeks: u8, |
There was a problem hiding this comment.
avg_weeks -> average_duration
window_secs -> tracked_duration
+ use Duration (don't restrict to weeks granularity) + comments on vars
| impl GeneralBucket { | ||
| fn new(scid: u64, slots_allocated: u16, liquidity_allocated: u64) -> Self { | ||
| let general_slot_allocation = | ||
| u8::max(5, u8::try_from((slots_allocated * 5).div_ceil(100)).unwrap()); |
There was a problem hiding this comment.
Pull 5 out into a constant, and validate slots_allocated >= 5, error if not.
Also have some sane check on liquidity_allocated to makes sure it doesn't round to 0.
| impl Channel { | ||
| fn new( | ||
| scid: u64, max_htlc_value_in_flight_msat: u64, max_accepted_htlcs: u16, | ||
| general_bucket_pct: u8, congestion_bucket_pct: u8, reputation_window: Duration, |
There was a problem hiding this comment.
Take BucketAllocations in rather than percentages and totals.
| if resolved_at < pending_htlc.added_at_unix_seconds { | ||
| return Err(()); | ||
| } |
There was a problem hiding this comment.
To look into: we don't want to get stuck in a scenario where we fail to remove a HTLC and it gets stuck using up resources even through it's been resolved.
We should optimistically remove the HTLC from the outgoing channel so that even if we fail we've still cleaned up our state.
If we know we're going to have several network roundtrips in here, I think it's safe to assume that our clock goes forward for an individual HTLC and keep this error. This is different for decaying averages (/clamping agreed on earlier) because they are dealing with different HTLCs which may happen in close proximity.
Part of #4384
This PR introduces a
ResourceManagertrait andDefaultResourceManagerimplementation of that trait which is based on the proposed mitigation in lightning/bolts#1280.It only covers the standalone implementation of the mitigation. I have done some testing with integrating it into the
ChannelManagerbut that can be done separately. As mentioned in the issue, the resource manager trait defines these 4 methods to be called from the channel manager:add_channelremove_channeladd_htlcresolve_htlcIntegrating into the
ChannelManagerThe
ResourceManageris intended to be internal to theChannelManagerrather than users instantiating their own and passing it to aChannelManagerconstructor.add/remove_channelshould be called when channels are opened/closed.add_htlc: When processing HTLCs, the channel manager would calladd_htlcwhich returns aForwardingOutcometelling it whether to forward or fail the HTLC along with the accountable signal to use in case that it should be forwarded. For the initial "read-only" mode, the channel manager would log the results but not actually fail the HTLC if it was told to do so. A bit more specific on where it would be called: I think it will be when processing theforward_htlcsbefore we queue theadd_htlcto the outgoing channelrust-lightning/lightning/src/ln/channelmanager.rs
Line 7650 in caf0aac
resolve_htlc: Used to tell back theResourceManagerthe resolution of an HTLC. It will be used to release bucket resources and update reputation/revenue values internally.This could have more tests but opening early to get thoughts on design if possible
cc @carlaKC