Skip to content

Conversation

@dnsbty
Copy link
Contributor

@dnsbty dnsbty commented Nov 13, 2025

Description

We've been seeing OOMs in our service pods when we receive larger than normal bursts of traffic:
image

We tracked the increased memory usage to the Sentry.Transport.Sender processes whose mailboxes are getting backed up when the sender gets rate limited by Sentry:
image

The current implementation of the sender ignores Sentry's rate limit headers until a 429 status is received. At that point, each sender process will sleep until the "Retry-After" period has ended. During the time while the process is sleeping, the process mailbox continues to fill up with more events to send.

This pull request creates a new Sentry.Transport.RateLimiter backed by ets, which will store current rate limits for the categories specified in the Sentry rate limit response header. During times where the category of message to be sent is under a rate limit, the client will now drop the message, as is recommended by Sentry's rate limiting docs. This should prevent pileup in the process mailboxes. And because it's now tracking rate limits for each message category, it will still allow other categories to get through while one is rate limited so that less data is lost going forward.

AI Disclosure: I used Claude Code for an initial pass on this work, before thoroughly reviewing and refining it manually.

I'm happy to make or receive any changes requested to help get this through.

Issues

@dnsbty
Copy link
Contributor Author

dnsbty commented Nov 14, 2025

Per the DangerJS check, should I be adding a line to the changelog? I assumed that should be left to maintainers, but happy to do that if it would be helpful.

@dnsbty
Copy link
Contributor Author

dnsbty commented Nov 15, 2025

We shipped this in our codebase today and you can see the results here (the change was deployed at 18:10 where you see the memory usage sharply drop):

image

Compare this to the same period one week earlier and you can see that the memory usage is much more stable now:

image

And to make sure that everything is getting through to Sentry still, these are our stats for spans from today:

image

And this is for the same period a week ago:

image

A week ago, 32.4K spans were rate limited out of 2.1M total, so ~1.54% rate limited. Today 35.9K spans were rate limited out of 3M total, so ~1.20% rate limited. I don't think this really means much though. A week ago we would have retried a lot of the events multiple times before dropping them, so fewer would have been dropped because more would have been kept in memory. And because we deployed today while memory was high, we would have lost a lot of events that were in memory but hadn't been sent yet, and the Sentry client wouldn't have reported on those yet. I didn't think about this prior to building this, but this change should prevent data loss during deploys since events won't be trapped in the process mailboxes.

@dnsbty
Copy link
Contributor Author

dnsbty commented Nov 18, 2025

Now with a couple more days' traffic, you can better see the longer term results of the change (deployed at the point where the cursor/line are on the memory graph):
image

Everything appears to be working as expected in our application at least.

Copy link
Collaborator

@whatyouhide whatyouhide left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super solid work here. I loved reading through this PR! I left some comments but the only kind-of-major one is the one about testing. Let me
know if you want to talk about it!

| Sentry.CheckIn.t()
| ClientReport.t()
| Sentry.Event.t()
| Sentry.Transaction.t()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix is unrelated right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was because of a dialyzer complaint, but now when I remove it I don't seem to be getting it anymore. But I noticed that the "item" type in that function doesn't match the type provided for "items" in Sentry.Envelope.t().

I thought about centralizing that into a Sentry.Envelope.item() type that could be reused, but decided to just do this for now to keep the update smaller. Since the dialyzer doesn't seem to care now, I'm happy to remove it if you'd prefer not have it in this PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes let's remove it and possibly go with the Sentry.Envelope.item() type in a separate PR 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

"""
@spec rate_limited?(String.t()) :: boolean()
def rate_limited?(category) do
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Categories here are a fixed set, right? Maybe we could be more specific on the strings we expect here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the Sentry docs, this is the list of possible categories:

  • default
  • error
  • transaction
  • monitor
  • span
  • log_item
  • security
  • attachment
  • session
  • profile
  • profile_chunk
  • replay
  • feedback
  • trace_metric
  • internal

It looks like right now the library only supports the following categories:

  • error
  • transaction
  • monitor
  • attachment
  • internal

We could enforce that on this function, but the downside I see to that is that it adds one more place where changes have to be made if new categories are added in the future. I think if we wanted to do that, I would probably recommend that we modify Sentry.Envelope.get_data_category/1 to return an atom instead of a string, and then we could create a Sentry.Envelope.data_category() type that would represent the set of possible category atoms like :error | :transaction | :monitor | :attachment | :internal just to make it more clear what is being supported.

That seems like it would probably be outside the scope of this PR, but I'm open to making that change if you think it would be best to do it now.

@spec rate_limited?(String.t()) :: boolean()
def rate_limited?(category) do
now = System.system_time(:second)
check_rate_limited(category, now) or check_rate_limited(:global, now)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is really fast but given we're doing this for every event I wonder if perhaps a single ets:select pass to check both categories in one go would perform better. Maybe not but could be interesting to benchmark.

Copy link
Contributor Author

@dnsbty dnsbty Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just set this up with Benchee. Please let me know if you see any issues with the logic of the benchmark:

Benchmark code
Mix.install([{:benchee, "~> 1.0"}])

# Create a table that will have a rate limit for errors, but no global limit
:ets.new(NoGlobalLimit, [:named_table, :public, :set, read_concurrency: true])
:ets.insert(NoGlobalLimit, {"error", 1764026065})

# Create a second table with a global rate limit and category rate limits
:ets.new(GlobalLimit, [:named_table, :public, :set, read_concurrency: true])
:ets.insert(GlobalLimit, {:global, 1764028122})
:ets.insert(GlobalLimit, {"error", 1764026065})

Benchee.run(
  %{
    "select (no results)" => fn ->
      :ets.select(NoGlobalLimit, [
        {{:global, :"$1"}, [], [{{:global, :"$1"}}]},
        {{"error", :"$1"}, [], [{{"error", :"$1"}}]}
      ])
    end,
    "select (category limit only)" => fn ->
      :ets.select(NoGlobalLimit, [
        {{:global, :"$1"}, [], [{{:global, :"$1"}}]},
        {{"error", :"$1"}, [], [{{"error", :"$1"}}]}
      ])
    end,
    "select (global limit only)" => fn ->
      :ets.select(GlobalLimit, [
        {{:global, :"$1"}, [], [{{:global, :"$1"}}]},
        {{"transaction", :"$1"}, [], [{{"transaction", :"$1"}}]}
      ])
    end,
    "select (global and category limits)" => fn ->
      :ets.select(GlobalLimit, [
        {{:global, :"$1"}, [], [{{:global, :"$1"}}]},
        {{"error", :"$1"}, [], [{{"error", :"$1"}}]}
      ])
    end,
    "lookup (no results)" => fn ->
      :ets.lookup(NoGlobalLimit, "transaction")
      :ets.lookup(NoGlobalLimit, :global)
    end,
    "lookup (category limit only)" => fn ->
      :ets.lookup(NoGlobalLimit, "error")
      # won't perform the global lookup since the category rate limit was found
    end,
    "lookup (global limit only)" => fn ->
      :ets.lookup(GlobalLimit, "error")
      :ets.lookup(GlobalLimit, :global)
    end,
    "lookup (global and category limits)" => fn ->
      :ets.lookup(GlobalLimit, "error")
      # won't perform the global lookup since the category rate limit was found
    end,
  }
)

And these were the results:

Full Benchee output
Operating System: macOS
CPU Information: Apple M1 Max
Number of Available Cores: 10
Available memory: 32 GB
Elixir 1.18.3
Erlang 27
JIT enabled: true

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 56 s
Excluding outliers: false

Benchmarking lookup (category limit only) ...
Benchmarking lookup (global and category limits) ...
Benchmarking lookup (global limit only) ...
Benchmarking lookup (no results) ...
Benchmarking select (category limit only) ...
Benchmarking select (global and category limits) ...
Benchmarking select (global limit only) ...
Benchmarking select (no results) ...
Calculating statistics...
Formatting results...

Name                                          ips        average  deviation         median         99th %
lookup (category limit only)               1.67 M        0.60 μs   ±847.42%        0.38 μs       11.92 μs
lookup (global and category limits)        1.64 M        0.61 μs   ±935.57%        0.42 μs       11.96 μs
lookup (no results)                        1.50 M        0.67 μs   ±846.72%        0.58 μs        1.63 μs
lookup (global limit only)                 1.42 M        0.71 μs   ±985.98%        0.58 μs        1.67 μs
select (category limit only)               0.50 M        1.98 μs   ±317.25%        1.79 μs        3.58 μs
select (no results)                        0.50 M        1.99 μs   ±340.78%        1.83 μs        3.58 μs
select (global and category limits)        0.50 M        2.01 μs   ±297.20%        1.88 μs        3.63 μs
select (global limit only)                 0.49 M        2.03 μs   ±313.39%        1.83 μs        3.67 μs

Comparison: 
lookup (category limit only)               1.67 M
lookup (global and category limits)        1.64 M - 1.02x slower +0.0109 μs
lookup (no results)                        1.50 M - 1.11x slower +0.0684 μs
lookup (global limit only)                 1.42 M - 1.18x slower +0.109 μs
select (category limit only)               0.50 M - 3.32x slower +1.38 μs
select (no results)                        0.50 M - 3.33x slower +1.39 μs
select (global and category limits)        0.50 M - 3.37x slower +1.41 μs
select (global limit only)                 0.49 M - 3.39x slower +1.43 μs
Comparison: 
lookup (category limit only)               1.67 M
lookup (global and category limits)        1.64 M - 1.02x slower +0.0109 μs
lookup (no results)                        1.50 M - 1.11x slower +0.0684 μs
lookup (global limit only)                 1.42 M - 1.18x slower +0.109 μs
select (category limit only)               0.50 M - 3.32x slower +1.38 μs
select (no results)                        0.50 M - 3.33x slower +1.39 μs
select (global and category limits)        0.50 M - 3.37x slower +1.41 μs
select (global limit only)                 0.49 M - 3.39x slower +1.43 μs

Based on that, it looks like it's going to be more performant to stick with the two lookups

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just now (sorry!) realized that the size of this ETS table is extremely contained—we'll never really go over one key per category, so even being very cautious we're not going over a few dozen rows in the ETS table. I benched with 20 rows in that table and, predictably, ets:select is every slower in comparison since it has to scan more rows. Well, better to be informed!


@impl true
def init(nil) do
_table = :ets.new(@table, [:named_table, :public, :set])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we gain anything in perf by using a ordered set here? You'd have slower inserts but this table is a 99% read heavy table I think. Might be also good to consider read_concurrency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely should have enabled read_concurrency, and I've fixed that now. What's the thinking behind ordered set instead of just set though? Wouldn't ordered set have worse performance for lookups (log n instead of constant)?

Copy link
Contributor Author

@dnsbty dnsbty Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I was already doing the benchmarking on lookup vs select, I also went ahead and did a quick benchmark on set vs ordered_set:

Benchmark code
# Create a table using a set with no global limit
:ets.new(SetNoGlobal, [:named_table, :public, :set, read_concurrency: true])
:ets.insert(SetNoGlobal, {"error", 1764026065})

# Create a table using a set with global limit
:ets.new(SetWithGlobal, [:named_table, :public, :set, read_concurrency: true])
:ets.insert(SetWithGlobal, {"error", 1764026065})
:ets.insert(SetWithGlobal, {:global, 1764026065})

# Create a table using a set with no global limit
:ets.new(OrderedSetNoGlobal, [:named_table, :public, :set, read_concurrency: true])
:ets.insert(OrderedSetNoGlobal, {"error", 1764026065})

# Create a table using a set with global limit
:ets.new(OrderedSetWithGlobal, [:named_table, :public, :set, read_concurrency: true])
:ets.insert(OrderedSetWithGlobal, {"error", 1764026065})
:ets.insert(OrderedSetWithGlobal, {:global, 1764026065})

Benchee.run(
  %{
    "set (no results)" => fn ->
      :ets.lookup(SetNoGlobal, "transaction")
      :ets.lookup(SetNoGlobal, :global)
    end,
    "set (category limit only)" => fn ->
      :ets.lookup(SetNoGlobal, "error")
      # won't perform the global lookup since the category rate limit was found
    end,
    "set (global limit only)" => fn ->
      :ets.lookup(SetWithGlobal, "transaction")
      :ets.lookup(SetWithGlobal, :global)
    end,
    "set (global and category limits)" => fn ->
      :ets.lookup(SetWithGlobal, "error")
      # won't perform the global lookup since the category rate limit was found
    end,
    "ordered set (no results)" => fn ->
      :ets.lookup(OrderedSetNoGlobal, "transaction")
      :ets.lookup(OrderedSetNoGlobal, :global)
    end,
    "ordered set (category limit only)" => fn ->
      :ets.lookup(OrderedSetNoGlobal, "error")
      # won't perform the global lookup since the category rate limit was found
    end,
    "ordered set (global limit only)" => fn ->
      :ets.lookup(OrderedSetWithGlobal, "error")
      :ets.lookup(OrderedSetWithGlobal, :global)
    end,
    "ordered set (global and category limits)" => fn ->
      :ets.lookup(OrderedSetWithGlobal, "error")
      # won't perform the global lookup since the category rate limit was found
    end,
  }
)

And here are the results:

Full Benchee output
Operating System: macOS
CPU Information: Apple M1 Max
Number of Available Cores: 10
Available memory: 32 GB
Elixir 1.18.3
Erlang 27
JIT enabled: true

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 56 s
Excluding outliers: false

Benchmarking ordered set (category limit only) ...
Benchmarking ordered set (global and category limits) ...
Benchmarking ordered set (global limit only) ...
Benchmarking ordered set (no results) ...
Benchmarking set (category limit only) ...
Benchmarking set (global and category limits) ...
Benchmarking set (global limit only) ...
Benchmarking set (no results) ...
Calculating statistics...
Formatting results...

Name                                               ips        average  deviation         median         99th %
set (category limit only)                       2.25 M      445.06 ns  ±1116.49%         375 ns        1250 ns
ordered set (global and category limits)        2.24 M      445.96 ns  ±1107.44%         375 ns        1250 ns
set (global and category limits)                2.23 M      449.07 ns  ±1133.50%         375 ns        1250 ns
ordered set (category limit only)               2.08 M      480.31 ns  ±1036.76%         375 ns        1417 ns
ordered set (no results)                        1.52 M      656.08 ns   ±876.48%         583 ns        1500 ns
set (no results)                                1.52 M      659.69 ns   ±962.66%         583 ns        1500 ns
ordered set (global limit only)                 1.49 M      671.89 ns   ±987.77%         583 ns        1542 ns
set (global limit only)                         1.48 M      674.39 ns   ±878.87%         583 ns        1542 ns

Comparison: 
set (category limit only)                       2.25 M
ordered set (global and category limits)        2.24 M - 1.00x slower +0.90 ns
set (global and category limits)                2.23 M - 1.01x slower +4.01 ns
ordered set (category limit only)               2.08 M - 1.08x slower +35.25 ns
ordered set (no results)                        1.52 M - 1.47x slower +211.02 ns
set (no results)                                1.52 M - 1.48x slower +214.64 ns
ordered set (global limit only)                 1.49 M - 1.51x slower +226.83 ns
set (global limit only)                         1.48 M - 1.52x slower +229.33 ns
Comparison: 
set (category limit only)                       2.25 M
ordered set (global and category limits)        2.24 M - 1.00x slower +0.90 ns
set (global and category limits)                2.23 M - 1.01x slower +4.01 ns
ordered set (category limit only)               2.08 M - 1.08x slower +35.25 ns
ordered set (no results)                        1.52 M - 1.47x slower +211.02 ns
set (no results)                                1.52 M - 1.48x slower +214.64 ns
ordered set (global limit only)                 1.49 M - 1.51x slower +226.83 ns
set (global limit only)                         1.48 M - 1.52x slower +229.33 ns

Based on that, it looks like set is maybe a tiny bit faster. It doesn't seem to be a significant difference, but I'll just stick with that default.

config_before = all_config()

# Clear rate limiter state before each test
if Process.whereis(Sentry.Transport.RateLimiter) do
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this comment is going to be possibly frustrating because it will lead to a bunch of changes, but I think this will necessarily make tests (async tests in particular) more brittle. We tried really hard to keep the suite fast so I’m very partial to this 😬

Basically, I'd make these changes:

  • Do not hardcode a name when starting the rate limiter GenServer. Default to __MODULE__ but make it optional.
  • Do not start the rate limiter GenServer in the application supervisor if Mix.env() == :test
  • Start a fresh GenServer for every test in the rate limiter unit tests.
  • In the code that calls the rate limiter, potentially use mocks or similar for testing the correct behavior.

This is "dirtier" than what you have in this PR but it's likely going to be more robust and more scalable in the future. How does this sound?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love to get your thoughts on the latest version of this. I started out by exactly following what you said, but the mock part is where I wasn't as sure on direction. This is what I ended up going with:

  • Don't hardcode a name. If provided, use the name from the options keyword list. If none provided, check for and use the name in the process dictionary. If still none provided, use __MODULE__.
  • Don't start the rate limiter GenServer in the application supervisor if Mix.env() == :test (although I added a new config option since Mix isn't available in releases)
  • Added code in Sentry.Case to start a new rate limiter GenServer and ets table with a custom name for each test and store the table name in the process dictionary. Then I modified the RateLimiter functions to get the table name in the same order as the initializing process (keyword list, then process dictionary, and finally __MODULE__)

Like you said, this feels "dirtier", but it was the best solution I could come up with short of pulling in a mocking library. I'm very open to other ideas or suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

More gracefully handle situations with many events & running into rate limits

2 participants