fix(driver): fall back to another EOA when a submission account is unusable#4547
fix(driver): fall back to another EOA when a submission account is unusable#4547frdrckj wants to merge 9 commits into
Conversation
|
All contributors have signed the CLA ✍️ ✅ |
There was a problem hiding this comment.
Code Review
This pull request introduces a fallback mechanism to retry transaction submissions using alternative accounts when encountering account-specific failures (such as insufficient funds, nonce issues, or underpriced transactions). Feedback focuses on improving the robustness of the error classification: first, by formatting the full anyhow::Error chain using format!("{err:#}") instead of err.to_string() to prevent underlying node errors from being hidden; and second, by loosening the substring matching logic to support non-Geth clients like Nethermind which format error messages differently.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
I have read the CLA Document and I hereby sign the CLA |
…usable EIP-7702 parallel submission picks an account from a FIFO pool but never falls back when the chosen account can't broadcast (no gas for the tx, stale nonce, or a pending tx that can't be replaced). A valid settlement then fails even when another funded account is available. Classify pre-broadcast node rejections as account-specific (`mempools::Error::SubmitterUnusable`) and, on such a failure, retry the settlement from another account while the submission deadline still holds. Spent accounts are held for the duration of the request so each retry picks a different one, and `try_acquire` keeps the fallback non-blocking to avoid deadlocking against concurrent settlements. Retrying is safe against double-submission: a settlement that actually lands always returns `Ok`, so only failures (where nothing was broadcast) are ever retried. `already known` is deliberately not treated as account-specific, since it means our exact tx is already in the mempool and may still be mined. Towards cowprotocol#4541.
Match keyword pairs (`insufficient`+`funds`, `nonce`+`low`) instead of exact
Geth phrases so non-Geth clients are also recognised (e.g. Nethermind's
`SenderInsufficientFunds` / `NonceTooLow`, which omit spaces), and format the
full error chain (`{:#}`) so a node message wrapped in extra context is still
matched.
e2daf2d to
28dc8f5
Compare
Issue cowprotocol#4541 asks the driver to not only retry a settlement from another submission EOA on an account-specific failure, but to temporarily stop assigning settlements to a stuck lane until it recovers. - Quarantine: a delegated account that fails to broadcast for an account-specific reason (no gas, stuck nonce, underpriced) is now held out of the selection pool for UNHEALTHY_ACCOUNT_COOLDOWN before being returned, so concurrent and subsequent settlements skip it. The direct solver EOA is never benched. - Surface SubmitterUnusable out of the mempool race: select_ok returns the last error, which could mask an account-specific failure reported by another mempool and silently defeat the retry/quarantine. Prefer SubmitterUnusable unless a settlement-specific terminal error (revert/expired) is authoritative. Adds unit tests for the quarantine withholding and the race-error preference.
28dc8f5 to
9950782
Compare
| } else if message.contains("underpriced") { | ||
| Some(AccountFailure::Underpriced) |
There was a problem hiding this comment.
Plain transaction underpriced is not always account-specific. It can mean the fee is below the node or network minimum, so retrying from another EOA will likely use the same fee path and can bench all submitters without fixing the submission.
Can we only classify replacement-underpriced errors here, such as replacement transaction underpriced or client equivalents, where the error shows this account has a stuck nonce?
There was a problem hiding this comment.
Right now it treats anything containing underpriced as account-specific, so a plain transaction underpriced from the node's fee floor gets the same treatment as replacement transaction underpriced. In the global case the retry just recomputes the same fee on the next EOA and fails again, benching each one for the cooldown without fixing anything. Exactly what you described.
Fix: I narrowed it to require both replacement and underpriced, so the global forms fall through to None. replacement transaction underpriced still classifies, and since it's substring matching on the lowercased message, spaceless client variants carrying both tokens are still covered. Only the plain/global strings drop out.
I also added a negative test asserting transaction underpriced and max fee per gas less than block base fee return None, since nothing guarded that before.
| match account_failure { | ||
| Some(reason) => Error::SubmitterUnusable(reason), |
There was a problem hiding this comment.
This can turn a post-broadcast error into a retryable account error. For example, one mempool can reject before broadcast, while another mempool already accepted the tx and later returns Error::Other, such as if the block stream ends after send_transaction.
In that case retrying from another EOA can double-submit the settlement while the first tx is still pending or already mined. Please keep last_error for errors that may happen after broadcast, and only override it with SubmitterUnusable when we know the failure happened before broadcast.
There was a problem hiding this comment.
Oh yes, you're right, this is reachable. I'm sorry.
With multiple mempools, execute() races the lanes under select_ok, which only returns Err once all of them fail and then hands back whichever error finished last. So one lane can reject pre-broadcast with SubmitterUnusable while another already got past send_transaction and only later fails its wait loop with Error::Other (the block stream ending after submit, like you described). Since race_error only protects Revert/SimulationRevert/Expired, that post-broadcast Other gets overridden into SubmitterUnusable, and the retry loop resubmits from another EOA while the first tx is still in flight.
Fix: set a shared flag the moment any lane's send_transaction returns Ok, and only let execute() surface SubmitterUnusable when that flag is false, i.e. when nothing was put on the wire. If any lane broadcast, we keep the real error and never signal a retry. The pre-broadcast case still works as before (flag stays false, we surface SubmitterUnusable and retry).
I'll gate on the broadcast flag rather than on last_error's type. select_ok can return the pre-broadcast SubmitterUnusable as last_error itself if that lane finishes last, even though another lane already broadcast, so checking the error alone isn't enough. Keying off "did anything broadcast" covers that.
On tests: race_error picks up the new broadcast arg, so account_failure_is_surfaced_over_a_masking_race_error needs updating. The existing connection reset assertion stays valid as the pre-broadcast case (flag false, still surfaces SubmitterUnusable). I'll add a post-broadcast case (flag true) asserting the Other is preserved, plus one for the corner where last_error is itself a SubmitterUnusable while another lane broadcast.
There was a problem hiding this comment.
Thanks, the any_broadcast flag fixes the post-broadcast case from this thread.
I think we still need one more guard here: if another lane returns an ambiguous non-account error before confirmed broadcast, like timeout, connection reset, already known, or nonce too high, that should stay non-retryable. Please track an non-retryable failure from any lane and only return SubmitterUnusable when no lane broadcasted and no such error was seen.
There was a problem hiding this comment.
I'll add a second shared flag in execute(), saw_nonretryable, set for any lane result that isn't SubmitterUnusable and isn't Disabled (Disabled is a clean skip that never touched the network). race_error then returns SubmitterUnusable only when nothing broadcast and that flag stays false, so every active mempool failed cleanly before sending. Otherwise it keeps the real error, and strips a SubmitterUnusable that select_ok happened to return last so it can't leak a retry.
Tests: the existing connection reset case in account_failure_is_surfaced_over_a_masking_race_error now flips so it no longer triggers a retry, since that's exactly the ambiguous sibling you describe. I'll add a case for the corner where the last error is itself SubmitterUnusable while the flag is set, and keep the case where every lane failed cleanly and we still retry.
Don't retry/quarantine on a post-broadcast failure. In a multi-mempool race, select_ok can surface a pre-broadcast SubmitterUnusable from one lane while another lane already broadcast and then failed post-broadcast (e.g. the block stream ending), so retrying from a different EOA could double-submit the settlement. Track whether any lane broadcast via a per-execute flag set right after mempool.submit() succeeds, and have race_error keep last_error once a tx is on the wire, downgrading even a SubmitterUnusable it happened to return last. Only classify replacement underpricing as account-specific. A plain 'transaction underpriced' is a node/network fee-floor rejection that every account hits the same way, so retrying just re-fails and benches each submitter without fixing anything. Require both 'replacement' and 'underpriced'; global underpriced now falls through to a non-retryable error. Towards cowprotocol#4541.
| } | ||
|
|
||
| #[test] | ||
| fn account_failure_is_surfaced_over_a_masking_race_error() { |
There was a problem hiding this comment.
Fallback is only safe when we know no tx was sent.
If we get an unclear error like timeout, connection reset, already known, or nonce too high, we cannot be sure no tx is pending.
In that case, please keep the error non-retryable. Only return SubmitterUnusable when all active mempools failed before sending.
There was a problem hiding this comment.
Covered by the saw_nonretryable guard.
race_error now returns SubmitterUnusable only when no lane broadcast and no lane saw an ambiguous error (timeout, connection reset, already known, nonce too high), so any of those from any lane keeps the result non-retryable.
| // The tx is now on the wire. Record it so a sibling mempool's pre-broadcast | ||
| // account failure can't trigger a retry that double-submits this settlement. | ||
| broadcasted.store(true, std::sync::atomic::Ordering::SeqCst); |
There was a problem hiding this comment.
Please check the signer balance before this point.
If the balance is too low, return SubmitterUnusable(InsufficientFunds) so another account can be tried.
There was a problem hiding this comment.
I now fetch the signer's balance and compare it to required_balance (the gas limit times the submission max_fee_per_gas, reusing the existing settlement helper). If the account cannot cover the gas, submit() returns SubmitterUnusable(InsufficientFunds) without sending, so the fallback picks another account and benches this one. A failed balance lookup is non-authoritative: I log it and proceed, letting the node decide rather than blocking submission on an RPC hiccup. This adds one eth_getBalance per submission attempt, and the reactive path still catches the case where the balance drops between the check and the send.
| let mode = guard.submission_mode(); | ||
| let executed = self | ||
| .mempools | ||
| .execute(&settlement, submission_deadline, &mode) |
There was a problem hiding this comment.
Please check the deadline right before each submit attempt. acquire() can wait, so the deadline may pass before this line runs.
There was a problem hiding this comment.
I'll move the deadline check to the top of the loop so it runs before each execute(), including the first one. A request whose window passed while it waited for a slot fails fast without submitting, and the current post-failure check folds into the same top-of-loop check.
| /// account is withheld from the pool for `cooldown` so concurrent and | ||
| /// subsequent settlements stop selecting a stuck or underfunded lane until | ||
| /// it recovers. No-op for the direct solver EOA, which is never benched. | ||
| fn quarantine(&mut self, cooldown: Duration) { |
There was a problem hiding this comment.
Please pass the failure reason into quarantine. The logs should show which account was benched, why it was benched, and for how long.
There was a problem hiding this comment.
Will do. I'll pass the failure reason into quarantine() and store it on the guard, then include it in the bench log so it shows which account was benched, why (insufficient_funds, nonce, or replacement_underpriced), and for how long.
| tracing::warn!( | ||
| ?mode, | ||
| "submission account unusable, retrying from another account" | ||
| ); |
There was a problem hiding this comment.
Please make this retry log more useful. It should show the failed account, the next account, and the reason for retrying.
There was a problem hiding this comment.
I'll expand the retry log to include the failed account, the next account it falls back to, and the reason.
| } | ||
|
|
||
| #[cfg(test)] | ||
| mod submitter_pool_tests { |
There was a problem hiding this comment.
These pool tests are useful, but we also need tests for the full retry flow.
Please cover direct-to-delegated fallback, delegated-to-direct fallback, no account available, deadline reached, unsafe errors, and tx already sent.
There was a problem hiding this comment.
I'll extract the retry loop into a small helper, submit_with_fallback(pool, guard, cooldown, deadline_reached, submit), that takes the submit step as a closure, so it can be unit-tested without the network. Tests will cover direct to delegated fallback, delegated to direct fallback, no account available, deadline reached (fails fast before submitting), an unsafe or ambiguous error (no retry), and success on the first attempt (no retry). These assert the helper's own contract; notify::executed stays in process_settle_request and is covered there
| pub enum AccountFailure { | ||
| InsufficientFunds, | ||
| Nonce, | ||
| Underpriced, |
There was a problem hiding this comment.
Can we rename this to ReplacementUnderpriced? Plain transaction underpriced should not trigger fallback. Only replacement underpriced should.
There was a problem hiding this comment.
I'll rename AccountFailure::Underpriced to ReplacementUnderpriced
| notify::executed( | ||
| &self.solver, | ||
| settlement.auction_id, | ||
| settlement.solution(), | ||
| &executed, | ||
| ); |
There was a problem hiding this comment.
Please keep solver notifications outside the retry loop.
Each retry attempt should be logged inside the loop, but notify::executed(...) should only run once, after the loop finishes with the final result.
So the split should be:
- inside the loop: log each failed account and retry
- after the loop: send one final solver notification
Please add a small test or check for this if it is easy.
There was a problem hiding this comment.
notify::executed already runs once, after the loop, with the final result; only the per-attempt retries log inside the loop, so the structure already matches what you describe. When I extract the retry loop into a helper, notify::executed stays in process_settle_request and is called once on the helper's returned result, so there is no path that double-notifies or skips it. Asserting exactly one notification at the unit level would need a mock notify sink driving process_settle_request (a full Competition harness), which is not set up today. The call site is single by construction; happy to add that harness test if you want it.
The any_broadcast flag only covers a confirmed send_transaction Ok. When several mempools race the same settlement, a sibling lane can fail before broadcast with an ambiguous error (timeout, connection reset, `already known`, `nonce too high`) whose tx might still be on the wire, while another lane cleanly reports an account failure. Surfacing SubmitterUnusable there and retrying from another EOA could double-submit the settlement. Track a saw_nonretryable flag in execute(), set for any lane error that is not SubmitterUnusable and not Disabled (Disabled is a clean skip that never touched the network). race_error now surfaces SubmitterUnusable only when no lane broadcast and no such error was seen; otherwise it keeps the real error and strips a SubmitterUnusable that select_ok happened to return last. Towards cowprotocol#4541.
…derpriced Only a replacement-underpriced error (a stuck nonce another account avoids) is account-specific and should trigger fallback; a plain or global underpriced is a node or network fee floor that every account hits the same way. The classifier already requires both "replacement" and "underpriced", so this is a naming change only. The metric and log label becomes replacement_underpriced to match. Per review feedback.
Before broadcasting, compare the signing account's balance to the gas the settlement needs (gas limit times the submission max_fee_per_gas). If it can't cover the cost, return SubmitterUnusable(InsufficientFunds) without sending, so the health-aware fallback picks another account and benches this one instead of wasting a submission. A failed balance lookup is not authoritative, so we log it and let the node decide. Per review feedback. Towards cowprotocol#4541.
…it better Extract the health-aware retry loop into submit_with_fallback, which takes the submission step as a closure so the fallback flow can be unit-tested without a live mempool. Adds tests for direct-to-delegated and delegated-to-direct fallback, no account available, deadline reached, an unsafe non-account error, and an immediate success. Also per review feedback: - check the submission deadline before each attempt, since acquiring a slot can wait past it - bench the failed account with its failure reason, and log which account was benched, why, and for how long - log the failed account, the next account, and the reason on each retry The solver notification still fires exactly once, after the loop, with the final result. Towards cowprotocol#4541.
Description
EIP-7702 parallel submission selects a submission account from a FIFO pool. When the selected account cannot broadcast the transaction for an account-specific reason (no balance for gas, a stale nonce, or a pending tx that cannot be replaced), the settlement currently fails even though another funded account may be available — and the stuck account keeps getting picked by later settlements. This is the gap described in #4541.
This PR makes EIP-7702 submission health-aware: it retries a failed settlement from another account and temporarily benches the unhealthy account so subsequent settlements stop assigning to a stuck lane until it recovers.
Changes
Reactive fallback + classification
mempools::Error::SubmitterUnusable(AccountFailure)and aclassify_submission_failurehelper that recognises pre-broadcast node rejections (insufficient funds,nonce too low,... underpriced) across Geth- and Nethermind-style wording.already knownandnonce too highare intentionally excluded.Mempool::submitreturns this variant (matching the full{:#}error chain) for broadcast-time rejections instead of the genericOther.Competition::process_settle_requestretries the settlement from another account onSubmitterUnusable. Spent accounts are held until the request finishes so each retry uses a different one, and a non-blockingSubmitterPool::try_acquireavoids deadlocking against concurrent settlements.Temporary quarantine — the "stop assigning to stuck lanes" half of #4541
UNHEALTHY_ACCOUNT_COOLDOWN(30s): it is withheld from the FIFO pool before being returned, so concurrent and subsequent settlements skip it until it recovers. It is retried automatically after the cooldown; a still-broken account is simply re-benched after one attempt. The direct solver EOA is never benched.Reliable propagation in multi-mempool setups
Mempools::executeraces mempools withselect_ok, which returns the last error when all fail — so a generic error (e.g. a timeout) from one mempool could mask aSubmitterUnusablefrom another and silently defeat the retry/quarantine. The race now surfacesSubmitterUnusablewhenever any mempool reported one, unless a settlement-specific terminal failure (revert/expired) is present, which stays authoritative.Metrics and solver notifications account for the new outcome.
Retrying is safe against double-submission: a settlement that actually lands always returns
Ok, so only failures (where nothing was broadcast) are ever retried. That is also whyalready knownis excluded.Scope
This covers the two behaviours #4541 asks for: skip-and-retry on account-specific failures, and temporarily marking unhealthy EOAs unavailable so stuck lanes stop receiving new submissions. Proactive balance/nonce RPC pre-checks before selecting an account are intentionally not added — they put RPC round-trips on the latency-sensitive settlement path and are inherently racy (balance/nonce can change between check and submit), while the node's actual rejection (which this PR classifies and benches on) is the source of truth. Quarantine already satisfies the "account is not already known to be stuck" signal by physically removing benched accounts from selection. Happy to add pre-checks in a follow-up if preferred.
How to test
Unit tests:
classifies_account_specific_submission_failures,classifies_non_geth_client_wording,does_not_classify_non_account_failures_as_account_specific— the classifier in both directions (guards against retrying reverts,already known, rate limits, transport errors,nonce too high).account_failure_is_surfaced_over_a_masking_race_error—SubmitterUnusablewins over a masking generic error but defers to settlement-specific failures.quarantined_account_is_withheld_from_the_pool,direct_slot_is_never_quarantined— benching withholds an unhealthy account while leaving healthy ones available, and never benches the solver EOA.Fixes #4541.