Skip to content

fix(driver): fall back to another EOA when a submission account is unusable#4547

Open
frdrckj wants to merge 9 commits into
cowprotocol:mainfrom
frdrckj:fix/eip7702-submitter-fallback
Open

fix(driver): fall back to another EOA when a submission account is unusable#4547
frdrckj wants to merge 9 commits into
cowprotocol:mainfrom
frdrckj:fix/eip7702-submitter-fallback

Conversation

@frdrckj

@frdrckj frdrckj commented Jun 22, 2026

Copy link
Copy Markdown

Description

EIP-7702 parallel submission selects a submission account from a FIFO pool. When the selected account cannot broadcast the transaction for an account-specific reason (no balance for gas, a stale nonce, or a pending tx that cannot be replaced), the settlement currently fails even though another funded account may be available — and the stuck account keeps getting picked by later settlements. This is the gap described in #4541.

This PR makes EIP-7702 submission health-aware: it retries a failed settlement from another account and temporarily benches the unhealthy account so subsequent settlements stop assigning to a stuck lane until it recovers.

Changes

Reactive fallback + classification

  • Add mempools::Error::SubmitterUnusable(AccountFailure) and a classify_submission_failure helper that recognises pre-broadcast node rejections (insufficient funds, nonce too low, ... underpriced) across Geth- and Nethermind-style wording. already known and nonce too high are intentionally excluded.
  • Mempool::submit returns this variant (matching the full {:#} error chain) for broadcast-time rejections instead of the generic Other.
  • Competition::process_settle_request retries the settlement from another account on SubmitterUnusable. Spent accounts are held until the request finishes so each retry uses a different one, and a non-blocking SubmitterPool::try_acquire avoids deadlocking against concurrent settlements.

Temporary quarantine — the "stop assigning to stuck lanes" half of #4541

  • On an account-specific failure the account is benched for UNHEALTHY_ACCOUNT_COOLDOWN (30s): it is withheld from the FIFO pool before being returned, so concurrent and subsequent settlements skip it until it recovers. It is retried automatically after the cooldown; a still-broken account is simply re-benched after one attempt. The direct solver EOA is never benched.

Reliable propagation in multi-mempool setups

  • Mempools::execute races mempools with select_ok, which returns the last error when all fail — so a generic error (e.g. a timeout) from one mempool could mask a SubmitterUnusable from another and silently defeat the retry/quarantine. The race now surfaces SubmitterUnusable whenever any mempool reported one, unless a settlement-specific terminal failure (revert/expired) is present, which stays authoritative.

  • Metrics and solver notifications account for the new outcome.

Retrying is safe against double-submission: a settlement that actually lands always returns Ok, so only failures (where nothing was broadcast) are ever retried. That is also why already known is excluded.

Scope

This covers the two behaviours #4541 asks for: skip-and-retry on account-specific failures, and temporarily marking unhealthy EOAs unavailable so stuck lanes stop receiving new submissions. Proactive balance/nonce RPC pre-checks before selecting an account are intentionally not added — they put RPC round-trips on the latency-sensitive settlement path and are inherently racy (balance/nonce can change between check and submit), while the node's actual rejection (which this PR classifies and benches on) is the source of truth. Quarantine already satisfies the "account is not already known to be stuck" signal by physically removing benched accounts from selection. Happy to add pre-checks in a follow-up if preferred.

How to test

RUST_MIN_STACK=3145728 cargo nextest run -p driver

Unit tests:

  • classifies_account_specific_submission_failures, classifies_non_geth_client_wording, does_not_classify_non_account_failures_as_account_specific — the classifier in both directions (guards against retrying reverts, already known, rate limits, transport errors, nonce too high).
  • account_failure_is_surfaced_over_a_masking_race_errorSubmitterUnusable wins over a masking generic error but defers to settlement-specific failures.
  • quarantined_account_is_withheld_from_the_pool, direct_slot_is_never_quarantined — benching withholds an unhealthy account while leaving healthy ones available, and never benches the solver EOA.

Fixes #4541.

@frdrckj frdrckj requested a review from a team as a code owner June 22, 2026 06:53
@github-actions

github-actions Bot commented Jun 22, 2026

Copy link
Copy Markdown

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a fallback mechanism to retry transaction submissions using alternative accounts when encountering account-specific failures (such as insufficient funds, nonce issues, or underpriced transactions). Feedback focuses on improving the robustness of the error classification: first, by formatting the full anyhow::Error chain using format!("{err:#}") instead of err.to_string() to prevent underlying node errors from being hidden; and second, by loosening the substring matching logic to support non-Geth clients like Nethermind which format error messages differently.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread crates/driver/src/infra/mempool/mod.rs Outdated
Comment thread crates/driver/src/domain/mempools.rs
@frdrckj

frdrckj commented Jun 22, 2026

Copy link
Copy Markdown
Author

I have read the CLA Document and I hereby sign the CLA

github-actions Bot added a commit that referenced this pull request Jun 22, 2026
frdrckj added 2 commits June 24, 2026 04:05
…usable

EIP-7702 parallel submission picks an account from a FIFO pool but never
falls back when the chosen account can't broadcast (no gas for the tx,
stale nonce, or a pending tx that can't be replaced). A valid settlement
then fails even when another funded account is available.

Classify pre-broadcast node rejections as account-specific
(`mempools::Error::SubmitterUnusable`) and, on such a failure, retry the
settlement from another account while the submission deadline still holds.
Spent accounts are held for the duration of the request so each retry picks
a different one, and `try_acquire` keeps the fallback non-blocking to avoid
deadlocking against concurrent settlements.

Retrying is safe against double-submission: a settlement that actually
lands always returns `Ok`, so only failures (where nothing was broadcast)
are ever retried. `already known` is deliberately not treated as
account-specific, since it means our exact tx is already in the mempool and
may still be mined.

Towards cowprotocol#4541.
Match keyword pairs (`insufficient`+`funds`, `nonce`+`low`) instead of exact
Geth phrases so non-Geth clients are also recognised (e.g. Nethermind's
`SenderInsufficientFunds` / `NonceTooLow`, which omit spaces), and format the
full error chain (`{:#}`) so a node message wrapped in extra context is still
matched.
@frdrckj frdrckj force-pushed the fix/eip7702-submitter-fallback branch from e2daf2d to 28dc8f5 Compare June 23, 2026 21:07
Issue cowprotocol#4541 asks the driver to not only retry a settlement from another
submission EOA on an account-specific failure, but to temporarily stop
assigning settlements to a stuck lane until it recovers.

- Quarantine: a delegated account that fails to broadcast for an
  account-specific reason (no gas, stuck nonce, underpriced) is now held
  out of the selection pool for UNHEALTHY_ACCOUNT_COOLDOWN before being
  returned, so concurrent and subsequent settlements skip it. The direct
  solver EOA is never benched.
- Surface SubmitterUnusable out of the mempool race: select_ok returns
  the last error, which could mask an account-specific failure reported
  by another mempool and silently defeat the retry/quarantine. Prefer
  SubmitterUnusable unless a settlement-specific terminal error
  (revert/expired) is authoritative.

Adds unit tests for the quarantine withholding and the race-error
preference.
@frdrckj frdrckj force-pushed the fix/eip7702-submitter-fallback branch from 28dc8f5 to 9950782 Compare June 23, 2026 21:12
Comment thread crates/driver/src/domain/mempools.rs Outdated
Comment on lines +706 to +707
} else if message.contains("underpriced") {
Some(AccountFailure::Underpriced)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plain transaction underpriced is not always account-specific. It can mean the fee is below the node or network minimum, so retrying from another EOA will likely use the same fee path and can bench all submitters without fixing the submission.

Can we only classify replacement-underpriced errors here, such as replacement transaction underpriced or client equivalents, where the error shows this account has a stuck nonce?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now it treats anything containing underpriced as account-specific, so a plain transaction underpriced from the node's fee floor gets the same treatment as replacement transaction underpriced. In the global case the retry just recomputes the same fee on the next EOA and fails again, benching each one for the cooldown without fixing anything. Exactly what you described.

Fix: I narrowed it to require both replacement and underpriced, so the global forms fall through to None. replacement transaction underpriced still classifies, and since it's substring matching on the lowercased message, spaceless client variants carrying both tokens are still covered. Only the plain/global strings drop out.

I also added a negative test asserting transaction underpriced and max fee per gas less than block base fee return None, since nothing guarded that before.

Comment thread crates/driver/src/domain/mempools.rs Outdated
Comment on lines +730 to +731
match account_failure {
Some(reason) => Error::SubmitterUnusable(reason),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can turn a post-broadcast error into a retryable account error. For example, one mempool can reject before broadcast, while another mempool already accepted the tx and later returns Error::Other, such as if the block stream ends after send_transaction.

In that case retrying from another EOA can double-submit the settlement while the first tx is still pending or already mined. Please keep last_error for errors that may happen after broadcast, and only override it with SubmitterUnusable when we know the failure happened before broadcast.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, you're right, this is reachable. I'm sorry.

With multiple mempools, execute() races the lanes under select_ok, which only returns Err once all of them fail and then hands back whichever error finished last. So one lane can reject pre-broadcast with SubmitterUnusable while another already got past send_transaction and only later fails its wait loop with Error::Other (the block stream ending after submit, like you described). Since race_error only protects Revert/SimulationRevert/Expired, that post-broadcast Other gets overridden into SubmitterUnusable, and the retry loop resubmits from another EOA while the first tx is still in flight.

Fix: set a shared flag the moment any lane's send_transaction returns Ok, and only let execute() surface SubmitterUnusable when that flag is false, i.e. when nothing was put on the wire. If any lane broadcast, we keep the real error and never signal a retry. The pre-broadcast case still works as before (flag stays false, we surface SubmitterUnusable and retry).

I'll gate on the broadcast flag rather than on last_error's type. select_ok can return the pre-broadcast SubmitterUnusable as last_error itself if that lane finishes last, even though another lane already broadcast, so checking the error alone isn't enough. Keying off "did anything broadcast" covers that.

On tests: race_error picks up the new broadcast arg, so account_failure_is_surfaced_over_a_masking_race_error needs updating. The existing connection reset assertion stays valid as the pre-broadcast case (flag false, still surfaces SubmitterUnusable). I'll add a post-broadcast case (flag true) asserting the Other is preserved, plus one for the corner where last_error is itself a SubmitterUnusable while another lane broadcast.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, the any_broadcast flag fixes the post-broadcast case from this thread.

I think we still need one more guard here: if another lane returns an ambiguous non-account error before confirmed broadcast, like timeout, connection reset, already known, or nonce too high, that should stay non-retryable. Please track an non-retryable failure from any lane and only return SubmitterUnusable when no lane broadcasted and no such error was seen.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a second shared flag in execute(), saw_nonretryable, set for any lane result that isn't SubmitterUnusable and isn't Disabled (Disabled is a clean skip that never touched the network). race_error then returns SubmitterUnusable only when nothing broadcast and that flag stays false, so every active mempool failed cleanly before sending. Otherwise it keeps the real error, and strips a SubmitterUnusable that select_ok happened to return last so it can't leak a retry.

Tests: the existing connection reset case in account_failure_is_surfaced_over_a_masking_race_error now flips so it no longer triggers a retry, since that's exactly the ambiguous sibling you describe. I'll add a case for the corner where the last error is itself SubmitterUnusable while the flag is set, and keep the case where every lane failed cleanly and we still retry.

frdrckj added 2 commits June 25, 2026 19:59
Don't retry/quarantine on a post-broadcast failure. In a multi-mempool race, select_ok can surface a pre-broadcast SubmitterUnusable from one lane while another lane already broadcast and then failed post-broadcast (e.g. the block stream ending), so retrying from a different EOA could double-submit the settlement. Track whether any lane broadcast via a per-execute flag set right after mempool.submit() succeeds, and have race_error keep last_error once a tx is on the wire, downgrading even a SubmitterUnusable it happened to return last.

Only classify replacement underpricing as account-specific. A plain 'transaction underpriced' is a node/network fee-floor rejection that every account hits the same way, so retrying just re-fails and benches each submitter without fixing anything. Require both 'replacement' and 'underpriced'; global underpriced now falls through to a non-retryable error.

Towards cowprotocol#4541.

@igorroncevic igorroncevic left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor things

}

#[test]
fn account_failure_is_surfaced_over_a_masking_race_error() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fallback is only safe when we know no tx was sent.

If we get an unclear error like timeout, connection reset, already known, or nonce too high, we cannot be sure no tx is pending.

In that case, please keep the error non-retryable. Only return SubmitterUnusable when all active mempools failed before sending.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Covered by the saw_nonretryable guard.
race_error now returns SubmitterUnusable only when no lane broadcast and no lane saw an ambiguous error (timeout, connection reset, already known, nonce too high), so any of those from any lane keeps the result non-retryable.

Comment on lines +247 to +249
// The tx is now on the wire. Record it so a sibling mempool's pre-broadcast
// account failure can't trigger a retry that double-submits this settlement.
broadcasted.store(true, std::sync::atomic::Ordering::SeqCst);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the signer balance before this point.

If the balance is too low, return SubmitterUnusable(InsufficientFunds) so another account can be tried.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now fetch the signer's balance and compare it to required_balance (the gas limit times the submission max_fee_per_gas, reusing the existing settlement helper). If the account cannot cover the gas, submit() returns SubmitterUnusable(InsufficientFunds) without sending, so the fallback picks another account and benches this one. A failed balance lookup is non-authoritative: I log it and proceed, letting the node decide rather than blocking submission on an RPC hiccup. This adds one eth_getBalance per submission attempt, and the reactive path still catches the case where the balance drops between the check and the send.

let mode = guard.submission_mode();
let executed = self
.mempools
.execute(&settlement, submission_deadline, &mode)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the deadline right before each submit attempt. acquire() can wait, so the deadline may pass before this line runs.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll move the deadline check to the top of the loop so it runs before each execute(), including the first one. A request whose window passed while it waited for a slot fails fast without submitting, and the current post-failure check folds into the same top-of-loop check.

/// account is withheld from the pool for `cooldown` so concurrent and
/// subsequent settlements stop selecting a stuck or underfunded lane until
/// it recovers. No-op for the direct solver EOA, which is never benched.
fn quarantine(&mut self, cooldown: Duration) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please pass the failure reason into quarantine. The logs should show which account was benched, why it was benched, and for how long.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. I'll pass the failure reason into quarantine() and store it on the guard, then include it in the bench log so it shows which account was benched, why (insufficient_funds, nonce, or replacement_underpriced), and for how long.

Comment on lines +1013 to +1016
tracing::warn!(
?mode,
"submission account unusable, retrying from another account"
);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make this retry log more useful. It should show the failed account, the next account, and the reason for retrying.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll expand the retry log to include the failed account, the next account it falls back to, and the reason.

}

#[cfg(test)]
mod submitter_pool_tests {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These pool tests are useful, but we also need tests for the full retry flow.

Please cover direct-to-delegated fallback, delegated-to-direct fallback, no account available, deadline reached, unsafe errors, and tx already sent.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll extract the retry loop into a small helper, submit_with_fallback(pool, guard, cooldown, deadline_reached, submit), that takes the submit step as a closure, so it can be unit-tested without the network. Tests will cover direct to delegated fallback, delegated to direct fallback, no account available, deadline reached (fails fast before submitting), an unsafe or ambiguous error (no retry), and success on the first attempt (no retry). These assert the helper's own contract; notify::executed stays in process_settle_request and is covered there

Comment thread crates/driver/src/domain/mempools.rs Outdated
pub enum AccountFailure {
InsufficientFunds,
Nonce,
Underpriced,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename this to ReplacementUnderpriced? Plain transaction underpriced should not trigger fallback. Only replacement underpriced should.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll rename AccountFailure::Underpriced to ReplacementUnderpriced

Comment on lines 1020 to 1025
notify::executed(
&self.solver,
settlement.auction_id,
settlement.solution(),
&executed,
);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep solver notifications outside the retry loop.

Each retry attempt should be logged inside the loop, but notify::executed(...) should only run once, after the loop finishes with the final result.

So the split should be:

  • inside the loop: log each failed account and retry
  • after the loop: send one final solver notification

Please add a small test or check for this if it is easy.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

notify::executed already runs once, after the loop, with the final result; only the per-attempt retries log inside the loop, so the structure already matches what you describe. When I extract the retry loop into a helper, notify::executed stays in process_settle_request and is called once on the helper's returned result, so there is no path that double-notifies or skips it. Asserting exactly one notification at the unit level would need a mock notify sink driving process_settle_request (a full Competition harness), which is not set up today. The call site is single by construction; happy to add that harness test if you want it.

frdrckj added 4 commits June 30, 2026 00:09
The any_broadcast flag only covers a confirmed send_transaction Ok. When
several mempools race the same settlement, a sibling lane can fail before
broadcast with an ambiguous error (timeout, connection reset, `already
known`, `nonce too high`) whose tx might still be on the wire, while another
lane cleanly reports an account failure. Surfacing SubmitterUnusable there
and retrying from another EOA could double-submit the settlement.

Track a saw_nonretryable flag in execute(), set for any lane error that is
not SubmitterUnusable and not Disabled (Disabled is a clean skip that never
touched the network). race_error now surfaces SubmitterUnusable only when no
lane broadcast and no such error was seen; otherwise it keeps the real error
and strips a SubmitterUnusable that select_ok happened to return last.

Towards cowprotocol#4541.
…derpriced

Only a replacement-underpriced error (a stuck nonce another account avoids)
is account-specific and should trigger fallback; a plain or global
underpriced is a node or network fee floor that every account hits the same
way. The classifier already requires both "replacement" and "underpriced",
so this is a naming change only. The metric and log label becomes
replacement_underpriced to match.

Per review feedback.
Before broadcasting, compare the signing account's balance to the gas the
settlement needs (gas limit times the submission max_fee_per_gas). If it
can't cover the cost, return SubmitterUnusable(InsufficientFunds) without
sending, so the health-aware fallback picks another account and benches this
one instead of wasting a submission. A failed balance lookup is not
authoritative, so we log it and let the node decide.

Per review feedback. Towards cowprotocol#4541.
…it better

Extract the health-aware retry loop into submit_with_fallback, which takes
the submission step as a closure so the fallback flow can be unit-tested
without a live mempool. Adds tests for direct-to-delegated and
delegated-to-direct fallback, no account available, deadline reached, an
unsafe non-account error, and an immediate success.

Also per review feedback:
- check the submission deadline before each attempt, since acquiring a slot
  can wait past it
- bench the failed account with its failure reason, and log which account
  was benched, why, and for how long
- log the failed account, the next account, and the reason on each retry

The solver notification still fires exactly once, after the loop, with the
final result.

Towards cowprotocol#4541.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: EIP-7702 settlement submission lacks fallback mechanism in case of missing funds

2 participants