Skip to content

proposal: Remote Write: Restart from segment-based savepoint#72

Open
kgeckhart wants to merge 6 commits intoprometheus:mainfrom
kgeckhart:kgeckhart/remote-write-restart-from-checkpoint
Open

proposal: Remote Write: Restart from segment-based savepoint#72
kgeckhart wants to merge 6 commits intoprometheus:mainfrom
kgeckhart:kgeckhart/remote-write-restart-from-checkpoint

Conversation

@kgeckhart
Copy link
Copy Markdown

@kgeckhart kgeckhart commented Dec 18, 2025

I am proposing a solution to allow remote write to restart from a savepoint (avoiding the use of checkpoint as there's already a checkpoint involved in remote write which does not help). This is a long standing issue that has seen a few attempts with none quite being accepted. It includes starting with a more basic solution which can be iterated to eventually get to at-least-once delivery.

Related to: prometheus/prometheus#8809

Signed-off-by: Kyle Eckhart <kgeckhart@users.noreply.github.com>
@kgeckhart kgeckhart marked this pull request as ready for review December 18, 2025 20:14
@kgeckhart kgeckhart changed the title Remote Write: Restart from checkpoint proposal: Remote Write: Restart from checkpoint Dec 18, 2025
Signed-off-by: Kyle Eckhart <kgeckhart@users.noreply.github.com>
Signed-off-by: Kyle Eckhart <kgeckhart@users.noreply.github.com>
@kgeckhart kgeckhart changed the title proposal: Remote Write: Restart from checkpoint proposal: Remote Write: Restart from savepoint Jan 29, 2026
…heckpoint

Signed-off-by: Kyle Eckhart <kgeckhart@users.noreply.github.com>
Copy link
Copy Markdown
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Thanks, this is super needed. Added some questions, but we need to play with a different strategies here. WAL segments based is a cool idea!

I'd recommend we stop talking about exact code components, it might add too much detail to the discussion (subjective opinion).

Comment on lines +29 to +31
### Pitfalls of the current solution

As mentioned in the why, this behavior is often confusing to users who know a WAL is in use but still finds they have missing data on restart.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Pitfalls of the current solution
As mentioned in the why, this behavior is often confusing to users who know a WAL is in use but still finds they have missing data on restart.

Probably dup?

1. Support resuming from a savepoint for each configured `remote_write` destination.
2. Taking a savepoint for a remote_write destination should not incur significant overhead.
3. Changing the `queue_configuration` for a `remote_write` destination should not result in a new savepoint entry.
* The `queue_configuration` includes fields like min/max shards and other performance tuning parameter.s
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* The `queue_configuration` includes fields like min/max shards and other performance tuning parameter.s
* The `queue_configuration` includes fields like min/max shards and other performance tuning parameters.


## Goals

1. Support resuming from a savepoint for each configured `remote_write` destination.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions:

  • Do you propose it to be by default or optional?
  • What to do for users who prefer fresh data over persisting old data?
  • What to do for users who want persistence, but WAL grown so much they will 100% fail to catch up, so it's better to drop?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Do you propose it to be by default or optional?

Yes at some point in the future I think this should become the default option but I think we'll need to evolve some default config options (and potentially introduce new ones) to make it a safe transition.

  1. What to do for users who prefer fresh data over persisting old data?
  2. What to do for users who want persistence, but WAL grown so much they will 100% fail to catch up, so it's better to drop?

Thinking about this in terms of what users can do today to prevent these things. A user can use remote_write.queue_config.sample_age_limit to try to keep 2 in check, I think it's the same answer for 3. Today you cannot modify this setting "online", changing it causes us to dump the WAL so I think the answer mostly ends up being "restart it".

After we have a segment based savepoint, I think this setting is still the most valuable tool to prevent unsustainable WAL growth and manage the tradeoff between freshness and completeness of data. I suggested that enabling savepoint should come with adding a default value of 2 hours. Couple this with the adjustment to ensure queue config won't trigger a replay a user can drop this value lower to catch-up and put it back. WDYT?


1. Support resuming from a savepoint for each configured `remote_write` destination.
2. Taking a savepoint for a remote_write destination should not incur significant overhead.
3. Changing the `queue_configuration` for a `remote_write` destination should not result in a new savepoint entry.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean?

Suggested change
3. Changing the `queue_configuration` for a `remote_write` destination should not result in a new savepoint entry.
3. Changing the `queue_configuration` for a `remote_write` destination should not result in losing a savepoint entry.

Comment on lines +41 to +42
5. Stretch: Remote write supports at-least-once delivery of samples in the WAL.
* Note: This has appeared to be the largest challenge with any existing implementation as it can cause significant overhead.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind expanding? I am missing what do you mean here by at-least-once delivery. I though we already do this ;p

Copy link
Copy Markdown
Author

@kgeckhart kgeckhart Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair, at-least-once here is for data that is persisted in the WAL. We don't offer this guarantee today due to a restart / queue hash changing causing the WAL to be thrown away.

The initial watcher based implementation is a lot closer but has a hole between the watcher moving to the next segment and the queue sending the data from that segment. Closing that hole would allow us to say "we guarantee data which is written to the WAL will be delivered at-least once"


Replaying a whole segment can still result in a fair amount of duplicated data on startup. If we added tracking the lowest timestamp delivered via remote write to in the savepoint it could reduce this number (lowest timestamp is required because the WAL supports out of order writes). At startup the tracked lowest timestamp would be used as marker for where to start writing data, reducing the amount of duplicated data replayed. At worst it would start from the beginning of the segment.

### Goal 5: Stretch: Remote write supports at-least-once delivery of samples in the WAL.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still unsure what at-least-once means. Without this section and discussion, isn't this proposal covering "at-least-once" mean already, if we go segment by segment? What's missing?


## Alternatives

1. `remote.QueueManager` should own syncing its own savepoint (most early implementations took this approach).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am lost here, should we avoid diving into too much of code architecture in this proposal?

e.g remote.WriteStorage sounds fancy, but it's a literally trivial HTTP client for remote write AFAIK. If we do save point there or on caller side.. it's fine to decide during implementation.

Eventually if feels this style might hard to review by maintainers who didn't look on current Prometheus RW sending code recently. Brining it to high level design (shard, queues, WAL watching) might give more chances someone else will help in review (:

1. `remote.QueueManager` should own syncing its own savepoint (most early implementations took this approach).
* `remote.QueueManager` already has a lot of responsibilities and will take on more for at-least-once.
* `remote.WriteStorage` has reasonable hook points to run this logic without adding a lot more complexity.
2. The savepoint should be synchronously updated when segments change.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. The savepoint should be synchronously updated when segments change.
2. The savepoint should be synchronously updated when segments change during WAL watching.

* `remote.QueueManager` already has a lot of responsibilities and will take on more for at-least-once.
* `remote.WriteStorage` has reasonable hook points to run this logic without adding a lot more complexity.
2. The savepoint should be synchronously updated when segments change.
* Introducing a bit of time between knowing that a segment changed to persisting it gives us more time to fully deliver the batch before we persist the change.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we split into Pros and Cons?

The pro is simpler implementation. Plus given queues have limit (they can be full) WAL watching tracking could be good enough persistence. Not saying this is the best option - but something to consider.


### Code flow

1. Adding another configurable timer to [`remote.WriteStorage.run()`](https://github.com/prometheus/prometheus/blob/f50ff0a40ad4ef24d9bb8e81a6546c8c994a924a/storage/remote/write.go#L114-L125) periodically persisting the current segments for each queue.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm, if we do it periodically anyway then relying only on WAL watching tracking might be good enough? 🤔

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also... why periodic, isn't segment rarely rotated/finished?

Copy link
Copy Markdown
Author

@kgeckhart kgeckhart Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm, if we do it periodically anyway then relying only on WAL watching tracking might be good enough? 🤔

That was my hypothesis when looking at alternatives which included a synchronous commit,

  • If we assume a 15 second queue delay then syncing the savepoint every 30 seconds gives a lot of room for the segment to be fully processed before being committed.

…tion proposal, expand more on at-least-once, alternatives pros/cons

Signed-off-by: Kyle Eckhart <kgeckhart@users.noreply.github.com>
Signed-off-by: Kyle Eckhart <kgeckhart@users.noreply.github.com>
@kgeckhart kgeckhart changed the title proposal: Remote Write: Restart from savepoint proposal: Remote Write: Restart from segment-based savepoint Mar 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants