Skip to content

fix(sinks): correct event finalization in socket sinks#25132

Open
simdugas wants to merge 14 commits intovectordotdev:masterfrom
simdugas:fix-tcp-server-reset-issue
Open

fix(sinks): correct event finalization in socket sinks#25132
simdugas wants to merge 14 commits intovectordotdev:masterfrom
simdugas:fix-tcp-server-reset-issue

Conversation

@simdugas
Copy link
Copy Markdown

@simdugas simdugas commented Apr 6, 2026

Summary

This should help fix an issue with the Vector TCP Sink where data was being lost when the server side of the TCP connection got reset, or timed out.

The original code used a peek-one/send-one/advance loop: if the connection was torn down between peek and send (or between poll_ready calls during a batch), the in-flight item was discarded and its finalizers were never marked Delivered.

This PR fixes event loss by switching to a batch-collect-then-flush model with retry on reconnect:

  1. Batch collection — available input is drained into a pending_batch (capped at 1,000 items) before any I/O happens.
  2. Atomic flush — the entire batch is fed and flushed as a unit. Finalizers are only marked Delivered after a successful flush.
  3. Retry on reconnect — if the flush fails, the connection is dropped and re-established; the same batch is retried, giving at-least-once delivery semantics.
  4. Lazy connection — the socket is not opened until there is data to send, and is closed cleanly when the input stream ends.
  5. Jittered backoff — exponential backoff (capped at 5 s, with full jitter) on send failures prevents tight reconnect loops when the remote is persistently unavailable.
  6. Typed peer-shutdown error — replaced the fragile error.kind() == Other && to_string() == "ShutdownCheck::Close" string comparison with a typed PeerShutdownError struct and is_peer_shutdown_error() helper.
  7. Shutdown check at batch boundary only — the poll_ready shutdown check now fires only when events_total == 0 (start of a new batch), avoiding mid-batch aborts that the retry loop would have recovered anyway.

Vector configuration

See simdugas/vector-tcp-reset-issue for a full demonstration of the issue/fix including vector configurations in the vector-before and vector-after folders.

How did you test this PR?

I have detailed the full testing steps with a demonstration of the issue before and after the fix in the repository simdugas/vector-tcp-reset-issue.

Change Type

  • Bug fix
  • New feature
  • Dependencies
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the no-changelog label to this PR.

References

Related: #9040

Notes

  • Please read our Vector contributor resources.
  • Do not hesitate to use @vectordotdev/vector to reach out to us regarding this PR.
  • Some CI checks run only after we manually approve them.
    • We recommend adding a pre-push hook, please see this template.
    • Alternatively, we recommend running the following locally before pushing to the remote branch:
      • make fmt
      • make check-clippy (if there are failures it's possible some of them can be fixed with make clippy-fix)
      • make test
  • After a review is requested, please avoid force pushes to help us review incrementally.
    • Feel free to push as many commits as you want. They will be squashed into one before merging.
    • For example, you can run git merge origin master and git push.
  • If this PR introduces changes Vector dependencies (modifies Cargo.lock), please
    run make build-licenses to regenerate the license inventory and commit the changes (if any). More details on the dd-rust-license-tool.

Events were advanced and finalized before delivery confirmation.
Replace send_all_peekable with a peek-then-send loop: peek the next
event, send it with empty finalizers, and only advance the stream
marking EventStatus::Delivered on success. A disconnect mid-send now
leaves the event in-flight for retry rather than silently dropping it.

Also move the shutdown_check into poll_ready (before start_send) so
a peer disconnect is detected without consuming the next stream item.
Remove the now-unused sink_ext module (VecSinkExt / SendAll).
@simdugas simdugas requested a review from a team as a code owner April 6, 2026 21:49
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 6, 2026

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9d7f945ec8

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@simdugas
Copy link
Copy Markdown
Author

simdugas commented Apr 7, 2026

I have read the CLA Document and I hereby sign the CLA

simdugas added 13 commits April 9, 2026 11:26
The previous per-event peek-send-advance loop dropped in-flight events
when a peer reset occurred between items. Replace it with a
collect-then-flush approach: drain available input into a pending batch,
feed+flush the whole batch atomically, and on error reconnect and retry
the same batch — no events are lost on peer reset or TCP RST.

Applies identically to both the TCP and Unix stream sinks.
Cap pending batch at MAX_PENDING_BATCH_ITEMS (1024) to bound memory
when the peer is slow or disconnected. Bundle sink and open_token into
an Option so the connection is established lazily (only when data is
ready) and torn down cleanly via RAII when the loop exits.

Applies to both TCP and Unix stream sinks.
Bump event count from 1000 to 2000 so the stream exceeds the 1024-item
pending batch cap, exercising the split-batch retry path through a
server reset.
Cast named functions to fn(usize) and annotate the connection Option
with an explicit type so the compiler can resolve the OpenToken generic
parameter without ambiguity.
Change 1024 to 1_000 to follow Rust convention for large numeric
literals, and remove the now-redundant hard-coded value from the
test comment.
Move the pending-batch cap to socket_bytes_sink where the sink lives
and re-export it as pub(crate), removing the duplicate local constants
in tcp and unix sinks.
Mirror the existing TCP reconnect test for the Unix stream sink:
bind a first listener, drain a small number of lines, drop it hard,
then assert the sink reconnects to a second listener and delivers all
remaining events without loss.
Restrict the shutdown check in poll_ready to the moment when
events_total is zero (start of a new batch).
Replace the string comparison used to detect peer shutdown errors
(error.kind() == Other && to_string() == "ShutdownCheck::Close") with
a typed PeerShutdownError struct and is_peer_shutdown_error() helper.
This is more robust and removes the magic string dependency.
Add inline comments to the TCP and Unix stream sink batch loops and
the reconnect test to make the at-least-once delivery guarantee
(whole-batch resend on reconnect) explicit for future readers.
Verify that is_peer_shutdown_error correctly identifies errors created
by peer_shutdown_io_error and rejects unrelated io::Error values.
Without a backoff the TCP/Unix stream sinks spin tightly on
reconnect when a remote endpoint is consistently refusing connections.
Introduce an exponential backoff (capped at 5 s) with full jitter on
each send failure and reset it after a successful flush, preventing
thundering-herd reconnection storms.
Collapse nested if-let blocks into a single let-chain expression
in the TCP and Unix stream sink shutdown paths for clarity.
@simdugas
Copy link
Copy Markdown
Author

simdugas commented Apr 9, 2026

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Hooray!

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@simdugas
Copy link
Copy Markdown
Author

simdugas commented Apr 9, 2026

Check Spelling seems to be failing in other PRs as well.

@simdugas
Copy link
Copy Markdown
Author

simdugas commented Apr 9, 2026

@vectordotdev/vector this PR should be ready for feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

domain: sinks Anything related to the Vector's sinks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant