proposal: Remote Write: Restart from segment-based savepoint#72
proposal: Remote Write: Restart from segment-based savepoint#72kgeckhart wants to merge 6 commits intoprometheus:mainfrom
Conversation
Signed-off-by: Kyle Eckhart <kgeckhart@users.noreply.github.com>
Signed-off-by: Kyle Eckhart <kgeckhart@users.noreply.github.com>
Signed-off-by: Kyle Eckhart <kgeckhart@users.noreply.github.com>
…heckpoint Signed-off-by: Kyle Eckhart <kgeckhart@users.noreply.github.com>
96370f0 to
cdc82e6
Compare
bwplotka
left a comment
There was a problem hiding this comment.
Nice!
Thanks, this is super needed. Added some questions, but we need to play with a different strategies here. WAL segments based is a cool idea!
I'd recommend we stop talking about exact code components, it might add too much detail to the discussion (subjective opinion).
| ### Pitfalls of the current solution | ||
|
|
||
| As mentioned in the why, this behavior is often confusing to users who know a WAL is in use but still finds they have missing data on restart. |
There was a problem hiding this comment.
| ### Pitfalls of the current solution | |
| As mentioned in the why, this behavior is often confusing to users who know a WAL is in use but still finds they have missing data on restart. |
Probably dup?
| 1. Support resuming from a savepoint for each configured `remote_write` destination. | ||
| 2. Taking a savepoint for a remote_write destination should not incur significant overhead. | ||
| 3. Changing the `queue_configuration` for a `remote_write` destination should not result in a new savepoint entry. | ||
| * The `queue_configuration` includes fields like min/max shards and other performance tuning parameter.s |
There was a problem hiding this comment.
| * The `queue_configuration` includes fields like min/max shards and other performance tuning parameter.s | |
| * The `queue_configuration` includes fields like min/max shards and other performance tuning parameters. |
|
|
||
| ## Goals | ||
|
|
||
| 1. Support resuming from a savepoint for each configured `remote_write` destination. |
There was a problem hiding this comment.
Some questions:
- Do you propose it to be by default or optional?
- What to do for users who prefer fresh data over persisting old data?
- What to do for users who want persistence, but WAL grown so much they will 100% fail to catch up, so it's better to drop?
There was a problem hiding this comment.
- Do you propose it to be by default or optional?
Yes at some point in the future I think this should become the default option but I think we'll need to evolve some default config options (and potentially introduce new ones) to make it a safe transition.
- What to do for users who prefer fresh data over persisting old data?
- What to do for users who want persistence, but WAL grown so much they will 100% fail to catch up, so it's better to drop?
Thinking about this in terms of what users can do today to prevent these things. A user can use remote_write.queue_config.sample_age_limit to try to keep 2 in check, I think it's the same answer for 3. Today you cannot modify this setting "online", changing it causes us to dump the WAL so I think the answer mostly ends up being "restart it".
After we have a segment based savepoint, I think this setting is still the most valuable tool to prevent unsustainable WAL growth and manage the tradeoff between freshness and completeness of data. I suggested that enabling savepoint should come with adding a default value of 2 hours. Couple this with the adjustment to ensure queue config won't trigger a replay a user can drop this value lower to catch-up and put it back. WDYT?
|
|
||
| 1. Support resuming from a savepoint for each configured `remote_write` destination. | ||
| 2. Taking a savepoint for a remote_write destination should not incur significant overhead. | ||
| 3. Changing the `queue_configuration` for a `remote_write` destination should not result in a new savepoint entry. |
There was a problem hiding this comment.
Do you mean?
| 3. Changing the `queue_configuration` for a `remote_write` destination should not result in a new savepoint entry. | |
| 3. Changing the `queue_configuration` for a `remote_write` destination should not result in losing a savepoint entry. |
| 5. Stretch: Remote write supports at-least-once delivery of samples in the WAL. | ||
| * Note: This has appeared to be the largest challenge with any existing implementation as it can cause significant overhead. |
There was a problem hiding this comment.
Do you mind expanding? I am missing what do you mean here by at-least-once delivery. I though we already do this ;p
There was a problem hiding this comment.
That's fair, at-least-once here is for data that is persisted in the WAL. We don't offer this guarantee today due to a restart / queue hash changing causing the WAL to be thrown away.
The initial watcher based implementation is a lot closer but has a hole between the watcher moving to the next segment and the queue sending the data from that segment. Closing that hole would allow us to say "we guarantee data which is written to the WAL will be delivered at-least once"
|
|
||
| Replaying a whole segment can still result in a fair amount of duplicated data on startup. If we added tracking the lowest timestamp delivered via remote write to in the savepoint it could reduce this number (lowest timestamp is required because the WAL supports out of order writes). At startup the tracked lowest timestamp would be used as marker for where to start writing data, reducing the amount of duplicated data replayed. At worst it would start from the beginning of the segment. | ||
|
|
||
| ### Goal 5: Stretch: Remote write supports at-least-once delivery of samples in the WAL. |
There was a problem hiding this comment.
Still unsure what at-least-once means. Without this section and discussion, isn't this proposal covering "at-least-once" mean already, if we go segment by segment? What's missing?
|
|
||
| ## Alternatives | ||
|
|
||
| 1. `remote.QueueManager` should own syncing its own savepoint (most early implementations took this approach). |
There was a problem hiding this comment.
I am lost here, should we avoid diving into too much of code architecture in this proposal?
e.g remote.WriteStorage sounds fancy, but it's a literally trivial HTTP client for remote write AFAIK. If we do save point there or on caller side.. it's fine to decide during implementation.
Eventually if feels this style might hard to review by maintainers who didn't look on current Prometheus RW sending code recently. Brining it to high level design (shard, queues, WAL watching) might give more chances someone else will help in review (:
| 1. `remote.QueueManager` should own syncing its own savepoint (most early implementations took this approach). | ||
| * `remote.QueueManager` already has a lot of responsibilities and will take on more for at-least-once. | ||
| * `remote.WriteStorage` has reasonable hook points to run this logic without adding a lot more complexity. | ||
| 2. The savepoint should be synchronously updated when segments change. |
There was a problem hiding this comment.
| 2. The savepoint should be synchronously updated when segments change. | |
| 2. The savepoint should be synchronously updated when segments change during WAL watching. |
| * `remote.QueueManager` already has a lot of responsibilities and will take on more for at-least-once. | ||
| * `remote.WriteStorage` has reasonable hook points to run this logic without adding a lot more complexity. | ||
| 2. The savepoint should be synchronously updated when segments change. | ||
| * Introducing a bit of time between knowing that a segment changed to persisting it gives us more time to fully deliver the batch before we persist the change. |
There was a problem hiding this comment.
Can we split into Pros and Cons?
The pro is simpler implementation. Plus given queues have limit (they can be full) WAL watching tracking could be good enough persistence. Not saying this is the best option - but something to consider.
|
|
||
| ### Code flow | ||
|
|
||
| 1. Adding another configurable timer to [`remote.WriteStorage.run()`](https://github.com/prometheus/prometheus/blob/f50ff0a40ad4ef24d9bb8e81a6546c8c994a924a/storage/remote/write.go#L114-L125) periodically persisting the current segments for each queue. |
There was a problem hiding this comment.
Hmmmm, if we do it periodically anyway then relying only on WAL watching tracking might be good enough? 🤔
There was a problem hiding this comment.
Also... why periodic, isn't segment rarely rotated/finished?
There was a problem hiding this comment.
Hmmmm, if we do it periodically anyway then relying only on WAL watching tracking might be good enough? 🤔
That was my hypothesis when looking at alternatives which included a synchronous commit,
- If we assume a 15 second queue delay then syncing the savepoint every 30 seconds gives a lot of room for the segment to be fully processed before being committed.
…tion proposal, expand more on at-least-once, alternatives pros/cons Signed-off-by: Kyle Eckhart <kgeckhart@users.noreply.github.com>
I am proposing a solution to allow remote write to restart from a savepoint (avoiding the use of checkpoint as there's already a checkpoint involved in remote write which does not help). This is a long standing issue that has seen a few attempts with none quite being accepted. It includes starting with a more basic solution which can be iterated to eventually get to at-least-once delivery.
Related to: prometheus/prometheus#8809