proposal: Remote Write: Restart from segment-based savepoint#72

Open

kgeckhart wants to merge 6 commits intoprometheus:mainfrom

kgeckhart:kgeckhart/remote-write-restart-from-checkpoint

kgeckhart commented Dec 18, 2025 •

edited

Loading

I am proposing a solution to allow remote write to restart from a savepoint (avoiding the use of checkpoint as there's already a checkpoint involved in remote write which does not help). This is a long standing issue that has seen a few attempts with none quite being accepted. It includes starting with a more basic solution which can be iterated to eventually get to at-least-once delivery.

Related to: prometheus/prometheus#8809


          Remote Write: Restart from checkpoint

beb076e

Signed-off-by: Kyle Eckhart <kgeckhart@users.noreply.github.com>

kgeckhart marked this pull request as ready for review

December 18, 2025 20:14

kgeckhart changed the title ~~Remote Write: Restart from checkpoint~~ proposal: Remote Write: Restart from checkpoint

kgeckhart added 2 commits

December 18, 2025 15:16


          Add pr id to the proposal file

d4945e1

Signed-off-by: Kyle Eckhart <kgeckhart@users.noreply.github.com>


          Remove duplicated block

df97ae1

Signed-off-by: Kyle Eckhart <kgeckhart@users.noreply.github.com>

kgeckhart changed the title ~~proposal: Remote Write: Restart from checkpoint~~ proposal: Remote Write: Restart from savepoint


          Rename checkpoint -> savepoint to eliminate confusion with existing c…

cdc82e6

…heckpoint

Signed-off-by: Kyle Eckhart <kgeckhart@users.noreply.github.com>

kgeckhart force-pushed the kgeckhart/remote-write-restart-from-checkpoint branch from 96370f0 to cdc82e6 Compare

January 29, 2026 19:11

kgeckhart mentioned this pull request

tsdb/agent: Checkpoint based on Series in Memory prometheus/prometheus#17617

Open

bwplotka reviewed

View reviewed changes

Member

bwplotka left a comment

Nice!

Thanks, this is super needed. Added some questions, but we need to play with a different strategies here. WAL segments based is a cool idea!

I'd recommend we stop talking about exact code components, it might add too much detail to the discussion (subjective opinion).

proposals/0072-remote-write-restart-from-savepoint.md Outdated

Comment on lines +29 to +31

		### Pitfalls of the current solution

		As mentioned in the why, this behavior is often confusing to users who know a WAL is in use but still finds they have missing data on restart.

Member

bwplotka Feb 16, 2026

Suggested change

      
            ### Pitfalls of the current solution
          
            As mentioned in the why, this behavior is often confusing to users who know a WAL is in use but still finds they have missing data on restart.

Probably dup?

proposals/0072-remote-write-restart-from-savepoint.md Outdated

+. Support resuming from a savepoint for each configured `remote_write` destination.
+. Taking a savepoint for a remote_write destination should not incur significant overhead.
+. Changing the `queue_configuration` for a `remote_write` destination should not result in a new savepoint entry.
+                 * The `queue_configuration` includes fields like min/max shards and other performance tuning parameter.s

Member

bwplotka Feb 16, 2026

Suggested change

      
               * The `queue_configuration` includes fields like min/max shards and other performance tuning parameter.s
          
               * The `queue_configuration` includes fields like min/max shards and other performance tuning parameters.

proposals/0072-remote-write-restart-from-savepoint.md Outdated


		## Goals

		1. Support resuming from a savepoint for each configured `remote_write` destination.

Member

bwplotka Feb 16, 2026

Some questions:

Do you propose it to be by default or optional?
What to do for users who prefer fresh data over persisting old data?
What to do for users who want persistence, but WAL grown so much they will 100% fail to catch up, so it's better to drop?

Author

kgeckhart Feb 18, 2026

Do you propose it to be by default or optional?

Yes at some point in the future I think this should become the default option but I think we'll need to evolve some default config options (and potentially introduce new ones) to make it a safe transition.

What to do for users who prefer fresh data over persisting old data?

What to do for users who want persistence, but WAL grown so much they will 100% fail to catch up, so it's better to drop?

Thinking about this in terms of what users can do today to prevent these things. A user can use remote_write.queue_config.sample_age_limit to try to keep 2 in check, I think it's the same answer for 3. Today you cannot modify this setting "online", changing it causes us to dump the WAL so I think the answer mostly ends up being "restart it".

After we have a segment based savepoint, I think this setting is still the most valuable tool to prevent unsustainable WAL growth and manage the tradeoff between freshness and completeness of data. I suggested that enabling savepoint should come with adding a default value of 2 hours. Couple this with the adjustment to ensure queue config won't trigger a replay a user can drop this value lower to catch-up and put it back. WDYT?

proposals/0072-remote-write-restart-from-savepoint.md Outdated

+. Support resuming from a savepoint for each configured `remote_write` destination.
+. Taking a savepoint for a remote_write destination should not incur significant overhead.
+. Changing the `queue_configuration` for a `remote_write` destination should not result in a new savepoint entry.

Member

bwplotka Feb 16, 2026

Do you mean?

Suggested change

      
            3. Changing the `queue_configuration` for a `remote_write` destination should not result in a new savepoint entry.
          
            3. Changing the `queue_configuration` for a `remote_write` destination should not result in losing a savepoint entry.

proposals/0072-remote-write-restart-from-savepoint.md

Comment on lines +41 to +42

		5. Stretch: Remote write supports at-least-once delivery of samples in the WAL.
		* Note: This has appeared to be the largest challenge with any existing implementation as it can cause significant overhead.

Member

bwplotka Feb 16, 2026

Do you mind expanding? I am missing what do you mean here by at-least-once delivery. I though we already do this ;p

Author

kgeckhart Feb 18, 2026 •

edited

Loading

That's fair, at-least-once here is for data that is persisted in the WAL. We don't offer this guarantee today due to a restart / queue hash changing causing the WAL to be thrown away.

The initial watcher based implementation is a lot closer but has a hole between the watcher moving to the next segment and the queue sending the data from that segment. Closing that hole would allow us to say "we guarantee data which is written to the WAL will be delivered at-least once"

proposals/0072-remote-write-restart-from-savepoint.md


		Replaying a whole segment can still result in a fair amount of duplicated data on startup. If we added tracking the lowest timestamp delivered via remote write to in the savepoint it could reduce this number (lowest timestamp is required because the WAL supports out of order writes). At startup the tracked lowest timestamp would be used as marker for where to start writing data, reducing the amount of duplicated data replayed. At worst it would start from the beginning of the segment.

		### Goal 5: Stretch: Remote write supports at-least-once delivery of samples in the WAL.

Member

bwplotka Feb 16, 2026

Still unsure what at-least-once means. Without this section and discussion, isn't this proposal covering "at-least-once" mean already, if we go segment by segment? What's missing?

proposals/0072-remote-write-restart-from-savepoint.md Outdated


		## Alternatives

		1. `remote.QueueManager` should own syncing its own savepoint (most early implementations took this approach).

Member

bwplotka Feb 16, 2026

I am lost here, should we avoid diving into too much of code architecture in this proposal?

e.g remote.WriteStorage sounds fancy, but it's a literally trivial HTTP client for remote write AFAIK. If we do save point there or on caller side.. it's fine to decide during implementation.

Eventually if feels this style might hard to review by maintainers who didn't look on current Prometheus RW sending code recently. Brining it to high level design (shard, queues, WAL watching) might give more chances someone else will help in review (:

proposals/0072-remote-write-restart-from-savepoint.md Outdated

+. `remote.QueueManager` should own syncing its own savepoint (most early implementations took this approach).
+                 * `remote.QueueManager` already has a lot of responsibilities and will take on more for at-least-once.
+                 * `remote.WriteStorage` has reasonable hook points to run this logic without adding a lot more  complexity.
+. The savepoint should be synchronously updated when segments change.

Member

bwplotka Feb 16, 2026

Suggested change

      
            2. The savepoint should be synchronously updated when segments change.
          
            2. The savepoint should be synchronously updated when segments change during WAL watching.

proposals/0072-remote-write-restart-from-savepoint.md Outdated

+                 * `remote.QueueManager` already has a lot of responsibilities and will take on more for at-least-once.
+                 * `remote.WriteStorage` has reasonable hook points to run this logic without adding a lot more  complexity.
+. The savepoint should be synchronously updated when segments change.
+                 * Introducing a bit of time between knowing that a segment changed to persisting it gives us more time to fully deliver the batch before we persist the change.

Member

bwplotka Feb 16, 2026

Can we split into Pros and Cons?

The pro is simpler implementation. Plus given queues have limit (they can be full) WAL watching tracking could be good enough persistence. Not saying this is the best option - but something to consider.

proposals/0072-remote-write-restart-from-savepoint.md Outdated


		### Code flow

		1. Adding another configurable timer to [`remote.WriteStorage.run()`](https://github.com/prometheus/prometheus/blob/f50ff0a40ad4ef24d9bb8e81a6546c8c994a924a/storage/remote/write.go#L114-L125) periodically persisting the current segments for each queue.

Member

bwplotka Feb 16, 2026

Hmmmm, if we do it periodically anyway then relying only on WAL watching tracking might be good enough? 🤔

Member

bwplotka Feb 16, 2026

Also... why periodic, isn't segment rarely rotated/finished?

Author

kgeckhart Feb 18, 2026 •

edited

Loading

Hmmmm, if we do it periodically anyway then relying only on WAL watching tracking might be good enough? 🤔

That was my hypothesis when looking at alternatives which included a synchronous commit,

If we assume a 15 second queue delay then syncing the savepoint every 30 seconds gives a lot of room for the segment to be fully processed before being committed.

bboreham mentioned this pull request

prometheus.remote_write silently drops any data older than the age of the alloy process itself grafana/alloy#5320

Open

beorn7 mentioned this pull request

tsdb/agent: Prevent unread segments from being truncated prometheus/prometheus#17616

Open

kgeckhart added 2 commits

February 24, 2026 16:51


          Remove feedback: typos and other small fixes, higher level implementa…

eb32233

…tion proposal, expand more on at-least-once, alternatives pros/cons

Signed-off-by: Kyle Eckhart <kgeckhart@users.noreply.github.com>

fmt

6f5e380

Signed-off-by: Kyle Eckhart <kgeckhart@users.noreply.github.com>

kgeckhart changed the title ~~proposal: Remote Write: Restart from savepoint~~ proposal: Remote Write: Restart from segment-based savepoint

kgeckhart mentioned this pull request

Implement replay grafana/alloy#5808

Open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet