fix(driver): fall back to another EOA when a submission account is unusable by frdrckj · Pull Request #4547 · cowprotocol/services

frdrckj · 2026-06-22T06:53:02Z

Description

EIP-7702 parallel submission selects a submission account from a FIFO pool. When the selected account cannot broadcast the transaction for an account-specific reason (no balance for gas, a stale nonce, or a pending tx that cannot be replaced), the settlement currently fails even though another funded account may be available — and the stuck account keeps getting picked by later settlements. This is the gap described in #4541.

This PR makes EIP-7702 submission health-aware: it retries a failed settlement from another account and temporarily benches the unhealthy account so subsequent settlements stop assigning to a stuck lane until it recovers.

Changes

Reactive fallback + classification

Add mempools::Error::SubmitterUnusable(AccountFailure) and a classify_submission_failure helper that recognises pre-broadcast node rejections (insufficient funds, nonce too low, ... underpriced) across Geth- and Nethermind-style wording. already known and nonce too high are intentionally excluded.
Mempool::submit returns this variant (matching the full {:#} error chain) for broadcast-time rejections instead of the generic Other.
Competition::process_settle_request retries the settlement from another account on SubmitterUnusable. Spent accounts are held until the request finishes so each retry uses a different one, and a non-blocking SubmitterPool::try_acquire avoids deadlocking against concurrent settlements.

Temporary quarantine — the "stop assigning to stuck lanes" half of #4541

On an account-specific failure the account is benched for UNHEALTHY_ACCOUNT_COOLDOWN (30s): it is withheld from the FIFO pool before being returned, so concurrent and subsequent settlements skip it until it recovers. It is retried automatically after the cooldown; a still-broken account is simply re-benched after one attempt. The direct solver EOA is never benched.

Reliable propagation in multi-mempool setups

Mempools::execute races mempools with select_ok, which returns the last error when all fail — so a generic error (e.g. a timeout) from one mempool could mask a SubmitterUnusable from another and silently defeat the retry/quarantine. The race now surfaces SubmitterUnusable whenever any mempool reported one, unless a settlement-specific terminal failure (revert/expired) is present, which stays authoritative.
Metrics and solver notifications account for the new outcome.

Retrying is safe against double-submission: a settlement that actually lands always returns Ok, so only failures (where nothing was broadcast) are ever retried. That is also why already known is excluded.

Scope

This covers the two behaviours #4541 asks for: skip-and-retry on account-specific failures, and temporarily marking unhealthy EOAs unavailable so stuck lanes stop receiving new submissions. Proactive balance/nonce RPC pre-checks before selecting an account are intentionally not added — they put RPC round-trips on the latency-sensitive settlement path and are inherently racy (balance/nonce can change between check and submit), while the node's actual rejection (which this PR classifies and benches on) is the source of truth. Quarantine already satisfies the "account is not already known to be stuck" signal by physically removing benched accounts from selection. Happy to add pre-checks in a follow-up if preferred.

How to test

RUST_MIN_STACK=3145728 cargo nextest run -p driver

Unit tests:

classifies_account_specific_submission_failures, classifies_non_geth_client_wording, does_not_classify_non_account_failures_as_account_specific — the classifier in both directions (guards against retrying reverts, already known, rate limits, transport errors, nonce too high).
account_failure_is_surfaced_over_a_masking_race_error — SubmitterUnusable wins over a masking generic error but defers to settlement-specific failures.
quarantined_account_is_withheld_from_the_pool, direct_slot_is_never_quarantined — benching withholds an unhealthy account while leaving healthy ones available, and never benches the solver EOA.

Fixes #4541.

github-actions · 2026-06-22T06:53:12Z

All contributors have signed the CLA ✍️ ✅
_{Posted by the CLA Assistant Lite bot.}

gemini-code-assist

Code Review

This pull request introduces a fallback mechanism to retry transaction submissions using alternative accounts when encountering account-specific failures (such as insufficient funds, nonce issues, or underpriced transactions). Feedback focuses on improving the robustness of the error classification: first, by formatting the full anyhow::Error chain using format!("{err:#}") instead of err.to_string() to prevent underlying node errors from being hidden; and second, by loosening the substring matching logic to support non-Geth clients like Nethermind which format error messages differently.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

frdrckj · 2026-06-22T06:56:38Z

I have read the CLA Document and I hereby sign the CLA

…usable EIP-7702 parallel submission picks an account from a FIFO pool but never falls back when the chosen account can't broadcast (no gas for the tx, stale nonce, or a pending tx that can't be replaced). A valid settlement then fails even when another funded account is available. Classify pre-broadcast node rejections as account-specific (`mempools::Error::SubmitterUnusable`) and, on such a failure, retry the settlement from another account while the submission deadline still holds. Spent accounts are held for the duration of the request so each retry picks a different one, and `try_acquire` keeps the fallback non-blocking to avoid deadlocking against concurrent settlements. Retrying is safe against double-submission: a settlement that actually lands always returns `Ok`, so only failures (where nothing was broadcast) are ever retried. `already known` is deliberately not treated as account-specific, since it means our exact tx is already in the mempool and may still be mined. Towards cowprotocol#4541.

Match keyword pairs (`insufficient`+`funds`, `nonce`+`low`) instead of exact Geth phrases so non-Geth clients are also recognised (e.g. Nethermind's `SenderInsufficientFunds` / `NonceTooLow`, which omit spaces), and format the full error chain (`{:#}`) so a node message wrapped in extra context is still matched.

Issue cowprotocol#4541 asks the driver to not only retry a settlement from another submission EOA on an account-specific failure, but to temporarily stop assigning settlements to a stuck lane until it recovers. - Quarantine: a delegated account that fails to broadcast for an account-specific reason (no gas, stuck nonce, underpriced) is now held out of the selection pool for UNHEALTHY_ACCOUNT_COOLDOWN before being returned, so concurrent and subsequent settlements skip it. The direct solver EOA is never benched. - Surface SubmitterUnusable out of the mempool race: select_ok returns the last error, which could mask an account-specific failure reported by another mempool and silently defeat the retry/quarantine. Prefer SubmitterUnusable unless a settlement-specific terminal error (revert/expired) is authoritative. Adds unit tests for the quarantine withholding and the race-error preference.

igorroncevic · 2026-06-25T11:08:02Z

+    } else if message.contains("underpriced") {
+        Some(AccountFailure::Underpriced)


Plain transaction underpriced is not always account-specific. It can mean the fee is below the node or network minimum, so retrying from another EOA will likely use the same fee path and can bench all submitters without fixing the submission.

Can we only classify replacement-underpriced errors here, such as replacement transaction underpriced or client equivalents, where the error shows this account has a stuck nonce?

Right now it treats anything containing underpriced as account-specific, so a plain transaction underpriced from the node's fee floor gets the same treatment as replacement transaction underpriced. In the global case the retry just recomputes the same fee on the next EOA and fails again, benching each one for the cooldown without fixing anything. Exactly what you described.

Fix: I narrowed it to require both replacement and underpriced, so the global forms fall through to None. replacement transaction underpriced still classifies, and since it's substring matching on the lowercased message, spaceless client variants carrying both tokens are still covered. Only the plain/global strings drop out.

I also added a negative test asserting transaction underpriced and max fee per gas less than block base fee return None, since nothing guarded that before.

igorroncevic · 2026-06-25T11:08:20Z

+    match account_failure {
+        Some(reason) => Error::SubmitterUnusable(reason),


This can turn a post-broadcast error into a retryable account error. For example, one mempool can reject before broadcast, while another mempool already accepted the tx and later returns Error::Other, such as if the block stream ends after send_transaction.

In that case retrying from another EOA can double-submit the settlement while the first tx is still pending or already mined. Please keep last_error for errors that may happen after broadcast, and only override it with SubmitterUnusable when we know the failure happened before broadcast.

Oh yes, you're right, this is reachable. I'm sorry.

With multiple mempools, execute() races the lanes under select_ok, which only returns Err once all of them fail and then hands back whichever error finished last. So one lane can reject pre-broadcast with SubmitterUnusable while another already got past send_transaction and only later fails its wait loop with Error::Other (the block stream ending after submit, like you described). Since race_error only protects Revert/SimulationRevert/Expired, that post-broadcast Other gets overridden into SubmitterUnusable, and the retry loop resubmits from another EOA while the first tx is still in flight.

Fix: set a shared flag the moment any lane's send_transaction returns Ok, and only let execute() surface SubmitterUnusable when that flag is false, i.e. when nothing was put on the wire. If any lane broadcast, we keep the real error and never signal a retry. The pre-broadcast case still works as before (flag stays false, we surface SubmitterUnusable and retry).

I'll gate on the broadcast flag rather than on last_error's type. select_ok can return the pre-broadcast SubmitterUnusable as last_error itself if that lane finishes last, even though another lane already broadcast, so checking the error alone isn't enough. Keying off "did anything broadcast" covers that.

On tests: race_error picks up the new broadcast arg, so account_failure_is_surfaced_over_a_masking_race_error needs updating. The existing connection reset assertion stays valid as the pre-broadcast case (flag false, still surfaces SubmitterUnusable). I'll add a post-broadcast case (flag true) asserting the Other is preserved, plus one for the corner where last_error is itself a SubmitterUnusable while another lane broadcast.

Thanks, the any_broadcast flag fixes the post-broadcast case from this thread.

I think we still need one more guard here: if another lane returns an ambiguous non-account error before confirmed broadcast, like timeout, connection reset, already known, or nonce too high, that should stay non-retryable. Please track an non-retryable failure from any lane and only return SubmitterUnusable when no lane broadcasted and no such error was seen.

I'll add a second shared flag in execute(), saw_nonretryable, set for any lane result that isn't SubmitterUnusable and isn't Disabled (Disabled is a clean skip that never touched the network). race_error then returns SubmitterUnusable only when nothing broadcast and that flag stays false, so every active mempool failed cleanly before sending. Otherwise it keeps the real error, and strips a SubmitterUnusable that select_ok happened to return last so it can't leak a retry.

Tests: the existing connection reset case in account_failure_is_surfaced_over_a_masking_race_error now flips so it no longer triggers a retry, since that's exactly the ambiguous sibling you describe. I'll add a case for the corner where the last error is itself SubmitterUnusable while the flag is set, and keep the case where every lane failed cleanly and we still retry.

Don't retry/quarantine on a post-broadcast failure. In a multi-mempool race, select_ok can surface a pre-broadcast SubmitterUnusable from one lane while another lane already broadcast and then failed post-broadcast (e.g. the block stream ending), so retrying from a different EOA could double-submit the settlement. Track whether any lane broadcast via a per-execute flag set right after mempool.submit() succeeds, and have race_error keep last_error once a tx is on the wire, downgrading even a SubmitterUnusable it happened to return last. Only classify replacement underpricing as account-specific. A plain 'transaction underpriced' is a node/network fee-floor rejection that every account hits the same way, so retrying just re-fails and benches each submitter without fixing anything. Require both 'replacement' and 'underpriced'; global underpriced now falls through to a non-retryable error. Towards cowprotocol#4541.

igorroncevic

A few minor things

igorroncevic · 2026-06-29T13:44:46Z

+    }
+
+    #[test]
+    fn account_failure_is_surfaced_over_a_masking_race_error() {


Fallback is only safe when we know no tx was sent.

If we get an unclear error like timeout, connection reset, already known, or nonce too high, we cannot be sure no tx is pending.

In that case, please keep the error non-retryable. Only return SubmitterUnusable when all active mempools failed before sending.

Covered by the saw_nonretryable guard.
race_error now returns SubmitterUnusable only when no lane broadcast and no lane saw an ambiguous error (timeout, connection reset, already known, nonce too high), so any of those from any lane keeps the result non-retryable.

igorroncevic · 2026-06-29T13:54:00Z

+        // The tx is now on the wire. Record it so a sibling mempool's pre-broadcast
+        // account failure can't trigger a retry that double-submits this settlement.
+        broadcasted.store(true, std::sync::atomic::Ordering::SeqCst);


Please check the signer balance before this point.

If the balance is too low, return SubmitterUnusable(InsufficientFunds) so another account can be tried.

I now fetch the signer's balance and compare it to required_balance (the gas limit times the submission max_fee_per_gas, reusing the existing settlement helper). If the account cannot cover the gas, submit() returns SubmitterUnusable(InsufficientFunds) without sending, so the fallback picks another account and benches this one. A failed balance lookup is non-authoritative: I log it and proceed, letting the node decide rather than blocking submission on an RPC hiccup. This adds one eth_getBalance per submission attempt, and the reactive path still catches the case where the balance drops between the check and the send.

igorroncevic · 2026-06-29T13:54:24Z

+            let mode = guard.submission_mode();
+            let executed = self
+                .mempools
+                .execute(&settlement, submission_deadline, &mode)


Please check the deadline right before each submit attempt. acquire() can wait, so the deadline may pass before this line runs.

I'll move the deadline check to the top of the loop so it runs before each execute(), including the first one. A request whose window passed while it waited for a slot fails fast without submitting, and the current post-failure check folds into the same top-of-loop check.

igorroncevic · 2026-06-29T13:54:41Z

+    /// account is withheld from the pool for `cooldown` so concurrent and
+    /// subsequent settlements stop selecting a stuck or underfunded lane until
+    /// it recovers. No-op for the direct solver EOA, which is never benched.
+    fn quarantine(&mut self, cooldown: Duration) {


Please pass the failure reason into quarantine. The logs should show which account was benched, why it was benched, and for how long.

Will do. I'll pass the failure reason into quarantine() and store it on the guard, then include it in the bench log so it shows which account was benched, why (insufficient_funds, nonce, or replacement_underpriced), and for how long.

igorroncevic · 2026-06-29T13:55:00Z

+            tracing::warn!(
+                ?mode,
+                "submission account unusable, retrying from another account"
+            );


Please make this retry log more useful. It should show the failed account, the next account, and the reason for retrying.

I'll expand the retry log to include the failed account, the next account it falls back to, and the reason.

igorroncevic · 2026-06-29T13:55:21Z

 }
+
+#[cfg(test)]
+mod submitter_pool_tests {


These pool tests are useful, but we also need tests for the full retry flow.

Please cover direct-to-delegated fallback, delegated-to-direct fallback, no account available, deadline reached, unsafe errors, and tx already sent.

I'll extract the retry loop into a small helper, submit_with_fallback(pool, guard, cooldown, deadline_reached, submit), that takes the submit step as a closure, so it can be unit-tested without the network. Tests will cover direct to delegated fallback, delegated to direct fallback, no account available, deadline reached (fails fast before submitting), an unsafe or ambiguous error (no retry), and success on the first attempt (no retry). These assert the helper's own contract; notify::executed stays in process_settle_request and is covered there

igorroncevic · 2026-06-29T13:55:43Z

+pub enum AccountFailure {
+    InsufficientFunds,
+    Nonce,
+    Underpriced,


Can we rename this to ReplacementUnderpriced? Plain transaction underpriced should not trigger fallback. Only replacement underpriced should.

I'll rename AccountFailure::Underpriced to ReplacementUnderpriced

igorroncevic · 2026-06-29T13:59:03Z

        notify::executed(
            &self.solver,
            settlement.auction_id,
            settlement.solution(),
            &executed,
        );


Please keep solver notifications outside the retry loop.

Each retry attempt should be logged inside the loop, but notify::executed(...) should only run once, after the loop finishes with the final result.

So the split should be:

inside the loop: log each failed account and retry

after the loop: send one final solver notification

Please add a small test or check for this if it is easy.

notify::executed already runs once, after the loop, with the final result; only the per-attempt retries log inside the loop, so the structure already matches what you describe. When I extract the retry loop into a helper, notify::executed stays in process_settle_request and is called once on the helper's returned result, so there is no path that double-notifies or skips it. Asserting exactly one notification at the unit level would need a mock notify sink driving process_settle_request (a full Competition harness), which is not set up today. The call site is single by construction; happy to add that harness test if you want it.

The any_broadcast flag only covers a confirmed send_transaction Ok. When several mempools race the same settlement, a sibling lane can fail before broadcast with an ambiguous error (timeout, connection reset, `already known`, `nonce too high`) whose tx might still be on the wire, while another lane cleanly reports an account failure. Surfacing SubmitterUnusable there and retrying from another EOA could double-submit the settlement. Track a saw_nonretryable flag in execute(), set for any lane error that is not SubmitterUnusable and not Disabled (Disabled is a clean skip that never touched the network). race_error now surfaces SubmitterUnusable only when no lane broadcast and no such error was seen; otherwise it keeps the real error and strips a SubmitterUnusable that select_ok happened to return last. Towards cowprotocol#4541.

…derpriced Only a replacement-underpriced error (a stuck nonce another account avoids) is account-specific and should trigger fallback; a plain or global underpriced is a node or network fee floor that every account hits the same way. The classifier already requires both "replacement" and "underpriced", so this is a naming change only. The metric and log label becomes replacement_underpriced to match. Per review feedback.

Before broadcasting, compare the signing account's balance to the gas the settlement needs (gas limit times the submission max_fee_per_gas). If it can't cover the cost, return SubmitterUnusable(InsufficientFunds) without sending, so the health-aware fallback picks another account and benches this one instead of wasting a submission. A failed balance lookup is not authoritative, so we log it and let the node decide. Per review feedback. Towards cowprotocol#4541.

…it better Extract the health-aware retry loop into submit_with_fallback, which takes the submission step as a closure so the fallback flow can be unit-tested without a live mempool. Adds tests for direct-to-delegated and delegated-to-direct fallback, no account available, deadline reached, an unsafe non-account error, and an immediate success. Also per review feedback: - check the submission deadline before each attempt, since acquiring a slot can wait past it - bench the failed account with its failure reason, and log which account was benched, why, and for how long - log the failed account, the next account, and the reason on each retry The solver notification still fires exactly once, after the loop, with the final result. Towards cowprotocol#4541.

frdrckj requested a review from a team as a code owner June 22, 2026 06:53

gemini-code-assist Bot reviewed Jun 22, 2026

View reviewed changes

Comment thread crates/driver/src/infra/mempool/mod.rs Outdated

Comment thread crates/driver/src/domain/mempools.rs

github-actions Bot added a commit that referenced this pull request Jun 22, 2026

@frdrckj has signed the CLA in #4547

0888cc4

frdrckj added 2 commits June 24, 2026 04:05

frdrckj force-pushed the fix/eip7702-submitter-fallback branch from e2daf2d to 28dc8f5 Compare June 23, 2026 21:07

frdrckj force-pushed the fix/eip7702-submitter-fallback branch from 28dc8f5 to 9950782 Compare June 23, 2026 21:12

igorroncevic reviewed Jun 25, 2026

View reviewed changes

frdrckj added 2 commits June 25, 2026 19:59

chore(driver): rewrap quarantine test doc comments for rustfmt

b2ee975

igorroncevic reviewed Jun 29, 2026

View reviewed changes

frdrckj added 4 commits June 30, 2026 00:09

		} else if message.contains("underpriced") {
		Some(AccountFailure::Underpriced)

		match account_failure {
		Some(reason) => Error::SubmitterUnusable(reason),

Uh oh!

Conversation

frdrckj commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

Scope

How to test

Uh oh!

github-actions Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

frdrckj commented Jun 22, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

igorroncevic left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

frdrckj commented Jun 22, 2026 •

edited

Loading

github-actions Bot commented Jun 22, 2026 •

edited

Loading