-
-
Notifications
You must be signed in to change notification settings - Fork 208
fix(transport): drop rate limited events #953
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Per the DangerJS check, should I be adding a line to the changelog? I assumed that should be left to maintainers, but happy to do that if it would be helpful. |
whatyouhide
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super solid work here. I loved reading through this PR! I left some comments but the only kind-of-major one is the one about testing. Let me
know if you want to talk about it!
lib/sentry/client_report/sender.ex
Outdated
| | Sentry.CheckIn.t() | ||
| | ClientReport.t() | ||
| | Sentry.Event.t() | ||
| | Sentry.Transaction.t() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fix is unrelated right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was because of a dialyzer complaint, but now when I remove it I don't seem to be getting it anymore. But I noticed that the "item" type in that function doesn't match the type provided for "items" in Sentry.Envelope.t().
I thought about centralizing that into a Sentry.Envelope.item() type that could be reused, but decided to just do this for now to keep the update smaller. Since the dialyzer doesn't seem to care now, I'm happy to remove it if you'd prefer not have it in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes let's remove it and possibly go with the Sentry.Envelope.item() type in a separate PR 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
lib/sentry/transport/rate_limiter.ex
Outdated
| """ | ||
| @spec rate_limited?(String.t()) :: boolean() | ||
| def rate_limited?(category) do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Categories here are a fixed set, right? Maybe we could be more specific on the strings we expect here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the Sentry docs, this is the list of possible categories:
- default
- error
- transaction
- monitor
- span
- log_item
- security
- attachment
- session
- profile
- profile_chunk
- replay
- feedback
- trace_metric
- internal
It looks like right now the library only supports the following categories:
- error
- transaction
- monitor
- attachment
- internal
We could enforce that on this function, but the downside I see to that is that it adds one more place where changes have to be made if new categories are added in the future. I think if we wanted to do that, I would probably recommend that we modify Sentry.Envelope.get_data_category/1 to return an atom instead of a string, and then we could create a Sentry.Envelope.data_category() type that would represent the set of possible category atoms like :error | :transaction | :monitor | :attachment | :internal just to make it more clear what is being supported.
That seems like it would probably be outside the scope of this PR, but I'm open to making that change if you think it would be best to do it now.
lib/sentry/transport/rate_limiter.ex
Outdated
| @spec rate_limited?(String.t()) :: boolean() | ||
| def rate_limited?(category) do | ||
| now = System.system_time(:second) | ||
| check_rate_limited(category, now) or check_rate_limited(:global, now) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is really fast but given we're doing this for every event I wonder if perhaps a single ets:select pass to check both categories in one go would perform better. Maybe not but could be interesting to benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just set this up with Benchee. Please let me know if you see any issues with the logic of the benchmark:
Benchmark code
Mix.install([{:benchee, "~> 1.0"}])
# Create a table that will have a rate limit for errors, but no global limit
:ets.new(NoGlobalLimit, [:named_table, :public, :set, read_concurrency: true])
:ets.insert(NoGlobalLimit, {"error", 1764026065})
# Create a second table with a global rate limit and category rate limits
:ets.new(GlobalLimit, [:named_table, :public, :set, read_concurrency: true])
:ets.insert(GlobalLimit, {:global, 1764028122})
:ets.insert(GlobalLimit, {"error", 1764026065})
Benchee.run(
%{
"select (no results)" => fn ->
:ets.select(NoGlobalLimit, [
{{:global, :"$1"}, [], [{{:global, :"$1"}}]},
{{"error", :"$1"}, [], [{{"error", :"$1"}}]}
])
end,
"select (category limit only)" => fn ->
:ets.select(NoGlobalLimit, [
{{:global, :"$1"}, [], [{{:global, :"$1"}}]},
{{"error", :"$1"}, [], [{{"error", :"$1"}}]}
])
end,
"select (global limit only)" => fn ->
:ets.select(GlobalLimit, [
{{:global, :"$1"}, [], [{{:global, :"$1"}}]},
{{"transaction", :"$1"}, [], [{{"transaction", :"$1"}}]}
])
end,
"select (global and category limits)" => fn ->
:ets.select(GlobalLimit, [
{{:global, :"$1"}, [], [{{:global, :"$1"}}]},
{{"error", :"$1"}, [], [{{"error", :"$1"}}]}
])
end,
"lookup (no results)" => fn ->
:ets.lookup(NoGlobalLimit, "transaction")
:ets.lookup(NoGlobalLimit, :global)
end,
"lookup (category limit only)" => fn ->
:ets.lookup(NoGlobalLimit, "error")
# won't perform the global lookup since the category rate limit was found
end,
"lookup (global limit only)" => fn ->
:ets.lookup(GlobalLimit, "error")
:ets.lookup(GlobalLimit, :global)
end,
"lookup (global and category limits)" => fn ->
:ets.lookup(GlobalLimit, "error")
# won't perform the global lookup since the category rate limit was found
end,
}
)And these were the results:
Full Benchee output
Operating System: macOS
CPU Information: Apple M1 Max
Number of Available Cores: 10
Available memory: 32 GB
Elixir 1.18.3
Erlang 27
JIT enabled: true
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 56 s
Excluding outliers: false
Benchmarking lookup (category limit only) ...
Benchmarking lookup (global and category limits) ...
Benchmarking lookup (global limit only) ...
Benchmarking lookup (no results) ...
Benchmarking select (category limit only) ...
Benchmarking select (global and category limits) ...
Benchmarking select (global limit only) ...
Benchmarking select (no results) ...
Calculating statistics...
Formatting results...
Name ips average deviation median 99th %
lookup (category limit only) 1.67 M 0.60 μs ±847.42% 0.38 μs 11.92 μs
lookup (global and category limits) 1.64 M 0.61 μs ±935.57% 0.42 μs 11.96 μs
lookup (no results) 1.50 M 0.67 μs ±846.72% 0.58 μs 1.63 μs
lookup (global limit only) 1.42 M 0.71 μs ±985.98% 0.58 μs 1.67 μs
select (category limit only) 0.50 M 1.98 μs ±317.25% 1.79 μs 3.58 μs
select (no results) 0.50 M 1.99 μs ±340.78% 1.83 μs 3.58 μs
select (global and category limits) 0.50 M 2.01 μs ±297.20% 1.88 μs 3.63 μs
select (global limit only) 0.49 M 2.03 μs ±313.39% 1.83 μs 3.67 μs
Comparison:
lookup (category limit only) 1.67 M
lookup (global and category limits) 1.64 M - 1.02x slower +0.0109 μs
lookup (no results) 1.50 M - 1.11x slower +0.0684 μs
lookup (global limit only) 1.42 M - 1.18x slower +0.109 μs
select (category limit only) 0.50 M - 3.32x slower +1.38 μs
select (no results) 0.50 M - 3.33x slower +1.39 μs
select (global and category limits) 0.50 M - 3.37x slower +1.41 μs
select (global limit only) 0.49 M - 3.39x slower +1.43 μs
Comparison:
lookup (category limit only) 1.67 M
lookup (global and category limits) 1.64 M - 1.02x slower +0.0109 μs
lookup (no results) 1.50 M - 1.11x slower +0.0684 μs
lookup (global limit only) 1.42 M - 1.18x slower +0.109 μs
select (category limit only) 0.50 M - 3.32x slower +1.38 μs
select (no results) 0.50 M - 3.33x slower +1.39 μs
select (global and category limits) 0.50 M - 3.37x slower +1.41 μs
select (global limit only) 0.49 M - 3.39x slower +1.43 μs
Based on that, it looks like it's going to be more performant to stick with the two lookups
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just now (sorry!) realized that the size of this ETS table is extremely contained—we'll never really go over one key per category, so even being very cautious we're not going over a few dozen rows in the ETS table. I benched with 20 rows in that table and, predictably, ets:select is every slower in comparison since it has to scan more rows. Well, better to be informed!
lib/sentry/transport/rate_limiter.ex
Outdated
|
|
||
| @impl true | ||
| def init(nil) do | ||
| _table = :ets.new(@table, [:named_table, :public, :set]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we gain anything in perf by using a ordered set here? You'd have slower inserts but this table is a 99% read heavy table I think. Might be also good to consider read_concurrency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely should have enabled read_concurrency, and I've fixed that now. What's the thinking behind ordered set instead of just set though? Wouldn't ordered set have worse performance for lookups (log n instead of constant)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I was already doing the benchmarking on lookup vs select, I also went ahead and did a quick benchmark on set vs ordered_set:
Benchmark code
# Create a table using a set with no global limit
:ets.new(SetNoGlobal, [:named_table, :public, :set, read_concurrency: true])
:ets.insert(SetNoGlobal, {"error", 1764026065})
# Create a table using a set with global limit
:ets.new(SetWithGlobal, [:named_table, :public, :set, read_concurrency: true])
:ets.insert(SetWithGlobal, {"error", 1764026065})
:ets.insert(SetWithGlobal, {:global, 1764026065})
# Create a table using a set with no global limit
:ets.new(OrderedSetNoGlobal, [:named_table, :public, :set, read_concurrency: true])
:ets.insert(OrderedSetNoGlobal, {"error", 1764026065})
# Create a table using a set with global limit
:ets.new(OrderedSetWithGlobal, [:named_table, :public, :set, read_concurrency: true])
:ets.insert(OrderedSetWithGlobal, {"error", 1764026065})
:ets.insert(OrderedSetWithGlobal, {:global, 1764026065})
Benchee.run(
%{
"set (no results)" => fn ->
:ets.lookup(SetNoGlobal, "transaction")
:ets.lookup(SetNoGlobal, :global)
end,
"set (category limit only)" => fn ->
:ets.lookup(SetNoGlobal, "error")
# won't perform the global lookup since the category rate limit was found
end,
"set (global limit only)" => fn ->
:ets.lookup(SetWithGlobal, "transaction")
:ets.lookup(SetWithGlobal, :global)
end,
"set (global and category limits)" => fn ->
:ets.lookup(SetWithGlobal, "error")
# won't perform the global lookup since the category rate limit was found
end,
"ordered set (no results)" => fn ->
:ets.lookup(OrderedSetNoGlobal, "transaction")
:ets.lookup(OrderedSetNoGlobal, :global)
end,
"ordered set (category limit only)" => fn ->
:ets.lookup(OrderedSetNoGlobal, "error")
# won't perform the global lookup since the category rate limit was found
end,
"ordered set (global limit only)" => fn ->
:ets.lookup(OrderedSetWithGlobal, "error")
:ets.lookup(OrderedSetWithGlobal, :global)
end,
"ordered set (global and category limits)" => fn ->
:ets.lookup(OrderedSetWithGlobal, "error")
# won't perform the global lookup since the category rate limit was found
end,
}
)And here are the results:
Full Benchee output
Operating System: macOS
CPU Information: Apple M1 Max
Number of Available Cores: 10
Available memory: 32 GB
Elixir 1.18.3
Erlang 27
JIT enabled: true
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 56 s
Excluding outliers: false
Benchmarking ordered set (category limit only) ...
Benchmarking ordered set (global and category limits) ...
Benchmarking ordered set (global limit only) ...
Benchmarking ordered set (no results) ...
Benchmarking set (category limit only) ...
Benchmarking set (global and category limits) ...
Benchmarking set (global limit only) ...
Benchmarking set (no results) ...
Calculating statistics...
Formatting results...
Name ips average deviation median 99th %
set (category limit only) 2.25 M 445.06 ns ±1116.49% 375 ns 1250 ns
ordered set (global and category limits) 2.24 M 445.96 ns ±1107.44% 375 ns 1250 ns
set (global and category limits) 2.23 M 449.07 ns ±1133.50% 375 ns 1250 ns
ordered set (category limit only) 2.08 M 480.31 ns ±1036.76% 375 ns 1417 ns
ordered set (no results) 1.52 M 656.08 ns ±876.48% 583 ns 1500 ns
set (no results) 1.52 M 659.69 ns ±962.66% 583 ns 1500 ns
ordered set (global limit only) 1.49 M 671.89 ns ±987.77% 583 ns 1542 ns
set (global limit only) 1.48 M 674.39 ns ±878.87% 583 ns 1542 ns
Comparison:
set (category limit only) 2.25 M
ordered set (global and category limits) 2.24 M - 1.00x slower +0.90 ns
set (global and category limits) 2.23 M - 1.01x slower +4.01 ns
ordered set (category limit only) 2.08 M - 1.08x slower +35.25 ns
ordered set (no results) 1.52 M - 1.47x slower +211.02 ns
set (no results) 1.52 M - 1.48x slower +214.64 ns
ordered set (global limit only) 1.49 M - 1.51x slower +226.83 ns
set (global limit only) 1.48 M - 1.52x slower +229.33 ns
Comparison:
set (category limit only) 2.25 M
ordered set (global and category limits) 2.24 M - 1.00x slower +0.90 ns
set (global and category limits) 2.23 M - 1.01x slower +4.01 ns
ordered set (category limit only) 2.08 M - 1.08x slower +35.25 ns
ordered set (no results) 1.52 M - 1.47x slower +211.02 ns
set (no results) 1.52 M - 1.48x slower +214.64 ns
ordered set (global limit only) 1.49 M - 1.51x slower +226.83 ns
set (global limit only) 1.48 M - 1.52x slower +229.33 ns
Based on that, it looks like set is maybe a tiny bit faster. It doesn't seem to be a significant difference, but I'll just stick with that default.
test/support/case.ex
Outdated
| config_before = all_config() | ||
|
|
||
| # Clear rate limiter state before each test | ||
| if Process.whereis(Sentry.Transport.RateLimiter) do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this comment is going to be possibly frustrating because it will lead to a bunch of changes, but I think this will necessarily make tests (async tests in particular) more brittle. We tried really hard to keep the suite fast so I’m very partial to this 😬
Basically, I'd make these changes:
- Do not hardcode a name when starting the rate limiter GenServer. Default to
__MODULE__but make it optional. - Do not start the rate limiter GenServer in the application supervisor if
Mix.env() == :test - Start a fresh GenServer for every test in the rate limiter unit tests.
- In the code that calls the rate limiter, potentially use mocks or similar for testing the correct behavior.
This is "dirtier" than what you have in this PR but it's likely going to be more robust and more scalable in the future. How does this sound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd love to get your thoughts on the latest version of this. I started out by exactly following what you said, but the mock part is where I wasn't as sure on direction. This is what I ended up going with:
- Don't hardcode a name. If provided, use the name from the options keyword list. If none provided, check for and use the name in the process dictionary. If still none provided, use
__MODULE__. - Don't start the rate limiter GenServer in the application supervisor if
Mix.env() == :test(although I added a new config option sinceMixisn't available in releases) - Added code in
Sentry.Caseto start a new rate limiter GenServer and ets table with a custom name for each test and store the table name in the process dictionary. Then I modified the RateLimiter functions to get the table name in the same order as the initializing process (keyword list, then process dictionary, and finally__MODULE__)
Like you said, this feels "dirtier", but it was the best solution I could come up with short of pulling in a mocking library. I'm very open to other ideas or suggestions.





Description
We've been seeing OOMs in our service pods when we receive larger than normal bursts of traffic:

We tracked the increased memory usage to the Sentry.Transport.Sender processes whose mailboxes are getting backed up when the sender gets rate limited by Sentry:

The current implementation of the sender ignores Sentry's rate limit headers until a 429 status is received. At that point, each sender process will sleep until the "Retry-After" period has ended. During the time while the process is sleeping, the process mailbox continues to fill up with more events to send.
This pull request creates a new
Sentry.Transport.RateLimiterbacked by ets, which will store current rate limits for the categories specified in the Sentry rate limit response header. During times where the category of message to be sent is under a rate limit, the client will now drop the message, as is recommended by Sentry's rate limiting docs. This should prevent pileup in the process mailboxes. And because it's now tracking rate limits for each message category, it will still allow other categories to get through while one is rate limited so that less data is lost going forward.AI Disclosure: I used Claude Code for an initial pass on this work, before thoroughly reviewing and refining it manually.
I'm happy to make or receive any changes requested to help get this through.
Issues