fix(transport): drop rate limited events #953

dnsbty · 2025-11-13T19:04:52Z

Description

We've been seeing OOMs in our service pods when we receive larger than normal bursts of traffic:

We tracked the increased memory usage to the Sentry.Transport.Sender processes whose mailboxes are getting backed up when the sender gets rate limited by Sentry:

The current implementation of the sender ignores Sentry's rate limit headers until a 429 status is received. At that point, each sender process will sleep until the "Retry-After" period has ended. During the time while the process is sleeping, the process mailbox continues to fill up with more events to send.

This pull request creates a new Sentry.Transport.RateLimiter backed by ets, which will store current rate limits for the categories specified in the Sentry rate limit response header. During times where the category of message to be sent is under a rate limit, the client will now drop the message, as is recommended by Sentry's rate limiting docs. This should prevent pileup in the process mailboxes. And because it's now tracking rate limits for each message category, it will still allow other categories to get through while one is rate limited so that less data is lost going forward.

AI Disclosure: I used Claude Code for an initial pass on this work, before thoroughly reviewing and refining it manually.

I'm happy to make or receive any changes requested to help get this through.

Issues

resolves: More gracefully handle situations with many events & running into rate limits #948

dnsbty · 2025-11-14T03:56:24Z

Per the DangerJS check, should I be adding a line to the changelog? I assumed that should be left to maintainers, but happy to do that if it would be helpful.

dnsbty · 2025-11-15T02:43:37Z

We shipped this in our codebase today and you can see the results here (the change was deployed at 18:10 where you see the memory usage sharply drop):

Compare this to the same period one week earlier and you can see that the memory usage is much more stable now:

And to make sure that everything is getting through to Sentry still, these are our stats for spans from today:

And this is for the same period a week ago:

A week ago, 32.4K spans were rate limited out of 2.1M total, so ~1.54% rate limited. Today 35.9K spans were rate limited out of 3M total, so ~1.20% rate limited. I don't think this really means much though. A week ago we would have retried a lot of the events multiple times before dropping them, so fewer would have been dropped because more would have been kept in memory. And because we deployed today while memory was high, we would have lost a lot of events that were in memory but hadn't been sent yet, and the Sentry client wouldn't have reported on those yet. I didn't think about this prior to building this, but this change should prevent data loss during deploys since events won't be trapped in the process mailboxes.

dnsbty · 2025-11-18T00:26:34Z

Now with a couple more days' traffic, you can better see the longer term results of the change (deployed at the point where the cursor/line are on the memory graph):

Everything appears to be working as expected in our application at least.

whatyouhide

Super solid work here. I loved reading through this PR! I left some comments but the only kind-of-major one is the one about testing. Let me
know if you want to talk about it!

whatyouhide · 2025-11-23T10:20:05Z

lib/sentry/client_report/sender.ex

               | Sentry.CheckIn.t()
               | ClientReport.t()
               | Sentry.Event.t()
+               | Sentry.Transaction.t()


This fix is unrelated right?

I think this was because of a dialyzer complaint, but now when I remove it I don't seem to be getting it anymore. But I noticed that the "item" type in that function doesn't match the type provided for "items" in Sentry.Envelope.t().

I thought about centralizing that into a Sentry.Envelope.item() type that could be reused, but decided to just do this for now to keep the update smaller. Since the dialyzer doesn't seem to care now, I'm happy to remove it if you'd prefer not have it in this PR.

Yes let's remove it and possibly go with the Sentry.Envelope.item() type in a separate PR 👍

whatyouhide · 2025-11-23T10:24:26Z

lib/sentry/transport/rate_limiter.ex

+
+  """
+  @spec rate_limited?(String.t()) :: boolean()
+  def rate_limited?(category) do


Categories here are a fixed set, right? Maybe we could be more specific on the strings we expect here?

According to the Sentry docs, this is the list of possible categories:

default

error

transaction

monitor

span

log_item

security

attachment

session

profile

profile_chunk

replay

feedback

trace_metric

internal

It looks like right now the library only supports the following categories:

error

transaction

monitor

attachment

internal

We could enforce that on this function, but the downside I see to that is that it adds one more place where changes have to be made if new categories are added in the future. I think if we wanted to do that, I would probably recommend that we modify Sentry.Envelope.get_data_category/1 to return an atom instead of a string, and then we could create a Sentry.Envelope.data_category() type that would represent the set of possible category atoms like :error | :transaction | :monitor | :attachment | :internal just to make it more clear what is being supported.

That seems like it would probably be outside the scope of this PR, but I'm open to making that change if you think it would be best to do it now.

whatyouhide · 2025-11-23T10:26:49Z

lib/sentry/transport/rate_limiter.ex

+  @spec rate_limited?(String.t()) :: boolean()
+  def rate_limited?(category) do
+    now = System.system_time(:second)
+    check_rate_limited(category, now) or check_rate_limited(:global, now)


I know this is really fast but given we're doing this for every event I wonder if perhaps a single ets:select pass to check both categories in one go would perform better. Maybe not but could be interesting to benchmark.

I just set this up with Benchee. Please let me know if you see any issues with the logic of the benchmark:

Benchmark code

Mix.install([{:benchee, "~> 1.0"}]) # Create a table that will have a rate limit for errors, but no global limit :ets.new(NoGlobalLimit, [:named_table, :public, :set, read_concurrency: true]) :ets.insert(NoGlobalLimit, {"error", 1764026065}) # Create a second table with a global rate limit and category rate limits :ets.new(GlobalLimit, [:named_table, :public, :set, read_concurrency: true]) :ets.insert(GlobalLimit, {:global, 1764028122}) :ets.insert(GlobalLimit, {"error", 1764026065}) Benchee.run( %{ "select (no results)" => fn -> :ets.select(NoGlobalLimit, [ {{:global, :"$1"}, [], [{{:global, :"$1"}}]}, {{"error", :"$1"}, [], [{{"error", :"$1"}}]} ]) end, "select (category limit only)" => fn -> :ets.select(NoGlobalLimit, [ {{:global, :"$1"}, [], [{{:global, :"$1"}}]}, {{"error", :"$1"}, [], [{{"error", :"$1"}}]} ]) end, "select (global limit only)" => fn -> :ets.select(GlobalLimit, [ {{:global, :"$1"}, [], [{{:global, :"$1"}}]}, {{"transaction", :"$1"}, [], [{{"transaction", :"$1"}}]} ]) end, "select (global and category limits)" => fn -> :ets.select(GlobalLimit, [ {{:global, :"$1"}, [], [{{:global, :"$1"}}]}, {{"error", :"$1"}, [], [{{"error", :"$1"}}]} ]) end, "lookup (no results)" => fn -> :ets.lookup(NoGlobalLimit, "transaction") :ets.lookup(NoGlobalLimit, :global) end, "lookup (category limit only)" => fn -> :ets.lookup(NoGlobalLimit, "error") # won't perform the global lookup since the category rate limit was found end, "lookup (global limit only)" => fn -> :ets.lookup(GlobalLimit, "error") :ets.lookup(GlobalLimit, :global) end, "lookup (global and category limits)" => fn -> :ets.lookup(GlobalLimit, "error") # won't perform the global lookup since the category rate limit was found end, } )

And these were the results:

Full Benchee output

Operating System: macOS CPU Information: Apple M1 Max Number of Available Cores: 10 Available memory: 32 GB Elixir 1.18.3 Erlang 27 JIT enabled: true Benchmark suite executing with the following configuration: warmup: 2 s time: 5 s memory time: 0 ns reduction time: 0 ns parallel: 1 inputs: none specified Estimated total run time: 56 s Excluding outliers: false Benchmarking lookup (category limit only) ... Benchmarking lookup (global and category limits) ... Benchmarking lookup (global limit only) ... Benchmarking lookup (no results) ... Benchmarking select (category limit only) ... Benchmarking select (global and category limits) ... Benchmarking select (global limit only) ... Benchmarking select (no results) ... Calculating statistics... Formatting results... Name ips average deviation median 99th % lookup (category limit only) 1.67 M 0.60 μs ±847.42% 0.38 μs 11.92 μs lookup (global and category limits) 1.64 M 0.61 μs ±935.57% 0.42 μs 11.96 μs lookup (no results) 1.50 M 0.67 μs ±846.72% 0.58 μs 1.63 μs lookup (global limit only) 1.42 M 0.71 μs ±985.98% 0.58 μs 1.67 μs select (category limit only) 0.50 M 1.98 μs ±317.25% 1.79 μs 3.58 μs select (no results) 0.50 M 1.99 μs ±340.78% 1.83 μs 3.58 μs select (global and category limits) 0.50 M 2.01 μs ±297.20% 1.88 μs 3.63 μs select (global limit only) 0.49 M 2.03 μs ±313.39% 1.83 μs 3.67 μs Comparison: lookup (category limit only) 1.67 M lookup (global and category limits) 1.64 M - 1.02x slower +0.0109 μs lookup (no results) 1.50 M - 1.11x slower +0.0684 μs lookup (global limit only) 1.42 M - 1.18x slower +0.109 μs select (category limit only) 0.50 M - 3.32x slower +1.38 μs select (no results) 0.50 M - 3.33x slower +1.39 μs select (global and category limits) 0.50 M - 3.37x slower +1.41 μs select (global limit only) 0.49 M - 3.39x slower +1.43 μs

Comparison: lookup (category limit only) 1.67 M lookup (global and category limits) 1.64 M - 1.02x slower +0.0109 μs lookup (no results) 1.50 M - 1.11x slower +0.0684 μs lookup (global limit only) 1.42 M - 1.18x slower +0.109 μs select (category limit only) 0.50 M - 3.32x slower +1.38 μs select (no results) 0.50 M - 3.33x slower +1.39 μs select (global and category limits) 0.50 M - 3.37x slower +1.41 μs select (global limit only) 0.49 M - 3.39x slower +1.43 μs

Based on that, it looks like it's going to be more performant to stick with the two lookups

I just now (sorry!) realized that the size of this ETS table is extremely contained—we'll never really go over one key per category, so even being very cautious we're not going over a few dozen rows in the ETS table. I benched with 20 rows in that table and, predictably, ets:select is every slower in comparison since it has to scan more rows. Well, better to be informed!

whatyouhide · 2025-11-23T10:27:52Z

lib/sentry/transport/rate_limiter.ex

+
+  @impl true
+  def init(nil) do
+    _table = :ets.new(@table, [:named_table, :public, :set])


Would we gain anything in perf by using a ordered set here? You'd have slower inserts but this table is a 99% read heavy table I think. Might be also good to consider read_concurrency.

I definitely should have enabled read_concurrency, and I've fixed that now. What's the thinking behind ordered set instead of just set though? Wouldn't ordered set have worse performance for lookups (log n instead of constant)?

Since I was already doing the benchmarking on lookup vs select, I also went ahead and did a quick benchmark on set vs ordered_set:

Benchmark code

# Create a table using a set with no global limit :ets.new(SetNoGlobal, [:named_table, :public, :set, read_concurrency: true]) :ets.insert(SetNoGlobal, {"error", 1764026065}) # Create a table using a set with global limit :ets.new(SetWithGlobal, [:named_table, :public, :set, read_concurrency: true]) :ets.insert(SetWithGlobal, {"error", 1764026065}) :ets.insert(SetWithGlobal, {:global, 1764026065}) # Create a table using a set with no global limit :ets.new(OrderedSetNoGlobal, [:named_table, :public, :set, read_concurrency: true]) :ets.insert(OrderedSetNoGlobal, {"error", 1764026065}) # Create a table using a set with global limit :ets.new(OrderedSetWithGlobal, [:named_table, :public, :set, read_concurrency: true]) :ets.insert(OrderedSetWithGlobal, {"error", 1764026065}) :ets.insert(OrderedSetWithGlobal, {:global, 1764026065}) Benchee.run( %{ "set (no results)" => fn -> :ets.lookup(SetNoGlobal, "transaction") :ets.lookup(SetNoGlobal, :global) end, "set (category limit only)" => fn -> :ets.lookup(SetNoGlobal, "error") # won't perform the global lookup since the category rate limit was found end, "set (global limit only)" => fn -> :ets.lookup(SetWithGlobal, "transaction") :ets.lookup(SetWithGlobal, :global) end, "set (global and category limits)" => fn -> :ets.lookup(SetWithGlobal, "error") # won't perform the global lookup since the category rate limit was found end, "ordered set (no results)" => fn -> :ets.lookup(OrderedSetNoGlobal, "transaction") :ets.lookup(OrderedSetNoGlobal, :global) end, "ordered set (category limit only)" => fn -> :ets.lookup(OrderedSetNoGlobal, "error") # won't perform the global lookup since the category rate limit was found end, "ordered set (global limit only)" => fn -> :ets.lookup(OrderedSetWithGlobal, "error") :ets.lookup(OrderedSetWithGlobal, :global) end, "ordered set (global and category limits)" => fn -> :ets.lookup(OrderedSetWithGlobal, "error") # won't perform the global lookup since the category rate limit was found end, } )

And here are the results:

Full Benchee output

Operating System: macOS CPU Information: Apple M1 Max Number of Available Cores: 10 Available memory: 32 GB Elixir 1.18.3 Erlang 27 JIT enabled: true Benchmark suite executing with the following configuration: warmup: 2 s time: 5 s memory time: 0 ns reduction time: 0 ns parallel: 1 inputs: none specified Estimated total run time: 56 s Excluding outliers: false Benchmarking ordered set (category limit only) ... Benchmarking ordered set (global and category limits) ... Benchmarking ordered set (global limit only) ... Benchmarking ordered set (no results) ... Benchmarking set (category limit only) ... Benchmarking set (global and category limits) ... Benchmarking set (global limit only) ... Benchmarking set (no results) ... Calculating statistics... Formatting results... Name ips average deviation median 99th % set (category limit only) 2.25 M 445.06 ns ±1116.49% 375 ns 1250 ns ordered set (global and category limits) 2.24 M 445.96 ns ±1107.44% 375 ns 1250 ns set (global and category limits) 2.23 M 449.07 ns ±1133.50% 375 ns 1250 ns ordered set (category limit only) 2.08 M 480.31 ns ±1036.76% 375 ns 1417 ns ordered set (no results) 1.52 M 656.08 ns ±876.48% 583 ns 1500 ns set (no results) 1.52 M 659.69 ns ±962.66% 583 ns 1500 ns ordered set (global limit only) 1.49 M 671.89 ns ±987.77% 583 ns 1542 ns set (global limit only) 1.48 M 674.39 ns ±878.87% 583 ns 1542 ns Comparison: set (category limit only) 2.25 M ordered set (global and category limits) 2.24 M - 1.00x slower +0.90 ns set (global and category limits) 2.23 M - 1.01x slower +4.01 ns ordered set (category limit only) 2.08 M - 1.08x slower +35.25 ns ordered set (no results) 1.52 M - 1.47x slower +211.02 ns set (no results) 1.52 M - 1.48x slower +214.64 ns ordered set (global limit only) 1.49 M - 1.51x slower +226.83 ns set (global limit only) 1.48 M - 1.52x slower +229.33 ns

Comparison: set (category limit only) 2.25 M ordered set (global and category limits) 2.24 M - 1.00x slower +0.90 ns set (global and category limits) 2.23 M - 1.01x slower +4.01 ns ordered set (category limit only) 2.08 M - 1.08x slower +35.25 ns ordered set (no results) 1.52 M - 1.47x slower +211.02 ns set (no results) 1.52 M - 1.48x slower +214.64 ns ordered set (global limit only) 1.49 M - 1.51x slower +226.83 ns set (global limit only) 1.48 M - 1.52x slower +229.33 ns

Based on that, it looks like set is maybe a tiny bit faster. It doesn't seem to be a significant difference, but I'll just stick with that default.

lib/sentry/transport/rate_limiter.ex

whatyouhide · 2025-11-23T10:42:45Z

test/support/case.ex

    config_before = all_config()

+    # Clear rate limiter state before each test
+    if Process.whereis(Sentry.Transport.RateLimiter) do


I know this comment is going to be possibly frustrating because it will lead to a bunch of changes, but I think this will necessarily make tests (async tests in particular) more brittle. We tried really hard to keep the suite fast so I’m very partial to this 😬

Basically, I'd make these changes:

Do not hardcode a name when starting the rate limiter GenServer. Default to __MODULE__ but make it optional.

Do not start the rate limiter GenServer in the application supervisor if Mix.env() == :test

Start a fresh GenServer for every test in the rate limiter unit tests.

In the code that calls the rate limiter, potentially use mocks or similar for testing the correct behavior.

This is "dirtier" than what you have in this PR but it's likely going to be more robust and more scalable in the future. How does this sound?

I'd love to get your thoughts on the latest version of this. I started out by exactly following what you said, but the mock part is where I wasn't as sure on direction. This is what I ended up going with:

Don't hardcode a name. If provided, use the name from the options keyword list. If none provided, check for and use the name in the process dictionary. If still none provided, use __MODULE__.

Don't start the rate limiter GenServer in the application supervisor if Mix.env() == :test (although I added a new config option since Mix isn't available in releases)

Added code in Sentry.Case to start a new rate limiter GenServer and ets table with a custom name for each test and store the table name in the process dictionary. Then I modified the RateLimiter functions to get the table name in the same order as the initializing process (keyword list, then process dictionary, and finally __MODULE__)

Like you said, this feels "dirtier", but it was the best solution I could come up with short of pulling in a mocking library. I'm very open to other ideas or suggestions.

dnsbty added 3 commits November 13, 2025 11:10

set up the rate limiter

d6af6e7

use response headers to update rate limits

787b496

discard events when rate limited

537e3da

whatyouhide reviewed Nov 23, 2025

View reviewed changes

dnsbty added 7 commits November 24, 2025 15:41

trim inside flat_map to avoid extra pass

e9a860a

enable read concurrency for the ets table

8aefde5

remove the nil function definition

f33d00d

allow custom table name through options

84a0b86

fix tests for elixir 1.13 compatibility

bd5756d

remove unrelated type change

79de306

use rate limiter table per test

ac32451

Uh oh!

fix(transport): drop rate limited events #953

Are you sure you want to change the base?

fix(transport): drop rate limited events #953

Uh oh!

Conversation

dnsbty commented Nov 13, 2025

Description

Issues

Uh oh!

dnsbty commented Nov 14, 2025

Uh oh!

dnsbty commented Nov 15, 2025

Uh oh!

dnsbty commented Nov 18, 2025

Uh oh!

whatyouhide left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dnsbty Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dnsbty Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dnsbty Nov 24, 2025 •

edited

Loading

dnsbty Nov 24, 2025 •

edited

Loading