Skip to content

feat(connectors): retry transient Doris Stream Load failures in-request#3574

Open
ryankert01 wants to merge 1 commit into
apache:masterfrom
ryankert01:feat/doris-sink-in-request-retry
Open

feat(connectors): retry transient Doris Stream Load failures in-request#3574
ryankert01 wants to merge 1 commit into
apache:masterfrom
ryankert01:feat/doris-sink-in-request-retry

Conversation

@ryankert01

Copy link
Copy Markdown
Member

Which issue does this PR address?

Relates to #3215

Rationale

Follow-up to the Doris sink (#3215). The connector already classified transient Stream Load outcomes as retryable but had no retry path, so under the runtime's at-most-once delivery a transient backend blip silently dropped the batch.

What changed?

The Doris sink classified transient Stream Load outcomes (5xx/408/429, transport errors, Publish Timeout) as CannotStoreData but never retried them: the runtime commits the consumer offset at poll time before consume() runs and discards its return value, so a transient failure dropped the batch with no replay.

consume() now retries a transiently-failed batch in-request via a new load_batch() that wraps send plus status classification, re-PUTing under the same deterministic label so Doris dedupes a prior attempt that actually landed (e.g. a 2xx whose body could not be read). Permanent failures (4xx, Fail, schema/redirect problems) are never retried. Backoff and jitter come from iggy_connector_sdk::retry, bounded by new max_retries/retry_delay/max_retry_delay config (defaults 3 / 200ms / 5s). This shrinks the at-most-once window within a single poll; cross-poll and crash delivery stay a runtime concern.

Local Execution

  • Passed
  • Pre-commit hooks: checks run manually. The license-headers hook cannot execute on this machine (its script needs bash 4+ mapfile; only bash 3.2 is present), but hawkeye check passes directly and CI enforces it. markdownlint, taplo, cargo fmt, cargo clippy -p iggy_connector_doris_sink --all-targets -- -D warnings, and cargo test -p iggy_connector_doris_sink (41 tests) all pass.

AI Usage

  1. Claude Code (Anthropic).
  2. Implemented the retry loop, config plumbing, tests, and README/config updates, after verifying the runtime's at-most-once delivery semantics directly in the runtime source.
  3. Three new wiremock unit tests pin the behavior: transient-then-success (retry fires), exhausted-budget (exact attempt count via .expect), and permanent-not-retried. Full crate suite, clippy, and doc lint pass locally.
  4. Yes.

@github-actions

Copy link
Copy Markdown

Thanks for the PR. It is labeled S-waiting-on-review and queued for review.

Slash commands (own line, regular comment) move it around the queue:

  • /ready - back to S-waiting-on-review after addressing feedback
  • /author - flip to S-waiting-on-author while you finish changes
  • /request-review @user-or-team - request a reviewer

See CONTRIBUTING.md for details.

@github-actions github-actions Bot added the S-waiting-on-review PR is waiting on a reviewer label Jun 27, 2026
@codecov

codecov Bot commented Jun 27, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 93.10345% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 46.38%. Comparing base (3b9ba2f) to head (c8ad134).

Files with missing lines Patch % Lines
core/connectors/sinks/doris_sink/src/lib.rs 93.10% 5 Missing and 3 partials ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##             master    #3574       +/-   ##
=============================================
- Coverage     74.06%   46.38%   -27.69%     
  Complexity      937      937               
=============================================
  Files          1249     1246        -3     
  Lines        128257   111830    -16427     
  Branches     104127    87700    -16427     
=============================================
- Hits          94996    51867    -43129     
- Misses        30222    57293    +27071     
+ Partials       3039     2670      -369     
Components Coverage Δ
Rust Core 39.11% <93.10%> (-35.60%) ⬇️
Java SDK 62.44% <ø> (ø)
C# SDK 72.10% <ø> (ø)
Python SDK 88.88% <ø> (ø)
PHP SDK 84.29% <ø> (ø)
Node SDK 91.35% <ø> (ø)
Go SDK 40.14% <ø> (ø)
Files with missing lines Coverage Δ
core/connectors/sinks/doris_sink/src/lib.rs 93.12% <93.10%> (+0.77%) ⬆️

... and 350 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

The Doris sink classified transient Stream Load outcomes (5xx/408/429,
transport errors, Publish Timeout) as retryable but never acted on them: the
runtime commits the consumer offset at poll time before consume() runs and
discards its return value, so a transient backend blip silently dropped the
batch under at-most-once delivery.

consume() now retries a transiently-failed batch in-request, re-PUTing under
the same deterministic label so Doris dedupes a prior attempt that actually
landed (e.g. a 2xx whose body could not be read). Permanent failures are never
retried. Backoff and jitter come from iggy_connector_sdk::retry, bounded by new
max_retries/retry_delay/max_retry_delay config (defaults 3/200ms/5s). This
shrinks the at-most-once window within a single poll; cross-poll and crash
delivery remain a runtime concern, not something a sink can fix.

Relates to apache#3215.
@ryankert01 ryankert01 force-pushed the feat/doris-sink-in-request-retry branch from 549e171 to c8ad134 Compare June 29, 2026 14:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

S-waiting-on-review PR is waiting on a reviewer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant