fix(sdk): fix deadlock in TrustedHttpContextProvider and improve block_on for current-thread runtimes#3490
fix(sdk): fix deadlock in TrustedHttpContextProvider and improve block_on for current-thread runtimes#3490
Conversation
…_in_place `get_quorum_public_key()` called `futures::executor::block_on()` for the cache-miss refetch path, which deadlocks when invoked from inside a tokio runtime (e.g. the FFI wrapper's `runtime.block_on()`). Replace with `tokio::task::block_in_place` + `Handle::current().block_on()` when a tokio context is active, falling back to a temporary `tokio::runtime::Runtime` when no context is present. Also remove diagnostic HTTP requests to google.com and the raw quorums endpoint from `dash_sdk_create_trusted` — they were debugging artifacts that should not be in production code. Closes #3432 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ide tokio Reproduce the scenario from issue #3432: a cache-miss refetch of a quorum public key called from within a multi-thread tokio runtime context. A minimal in-process HTTP server serves the /quorums response so no external network access is needed. The test would have deadlocked with the old `futures::executor::block_on` approach. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…m_public_key Replace futures::executor::block_on with a scoped OS thread running its own current-thread tokio runtime. This fixes two failure modes: 1. futures::executor::block_on deadlocks inside a current-thread tokio runtime: the single thread is both executor and I/O driver; blocking it for block_on starves the reactor and the reqwest future never completes. 2. tokio::task::block_in_place panics on current-thread runtimes. std::thread::scope spawns a new OS thread with no tokio context; the thread creates its own current-thread runtime, runs find_quorum to completion, and the scope ensures the thread is joined before get_quorum_public_key returns. Borrowing self across the scope boundary is safe because the scope exits only after the thread joins. The overhead (thread + runtime creation per cache miss) is acceptable because quorum cache misses are rare. Also updates the regression test: it now uses a current-thread runtime (reproducing the exact deadlock scenario) with a 5s timeout. The old futures::executor::block_on code was verified to trigger the timeout. Closes #3432 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
block_on previously called tokio::task::block_in_place unconditionally, which panics on current-thread runtimes. Now it detects the runtime flavor and branches accordingly: - No runtime: creates a temporary current-thread runtime - CurrentThread: spawns a dedicated OS thread with its own runtime - MultiThread (and future flavors): uses block_in_place as before Fixes #3432 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ldcard arm RuntimeFlavor is #[non_exhaustive], so _ already covers MultiThread and any future multi-threaded variants. Deduplicate the identical arms. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…:block_on Copies the runtime-aware block_on utility into the trusted context provider so it is available for future use without a shared-crate dependency. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace Err(_) with Err(e) and include the error in the trace message so the reason no runtime was found is not silently discarded. Also reverts the block_on copy added to trusted-context-provider. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 18 minutes and 4 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughReplaces blocking futures executor usage inside Tokio runtimes with thread-isolated current-thread Tokio runtimes, removes diagnostic reqwest calls from the FFI prefetch task, swaps Changes
Sequence Diagram(s)sequenceDiagram
participant App
participant FFI_SDK as FFI\ SDK
participant PrefetchTask as Prefetch\ Task
participant Provider as TrustedContextProvider
participant Thread as Quorum\ Thread
participant TokioRT as Tokio\ Current-Thread\ Runtime
participant QuorumHTTP as Quorum\ HTTP\ Service
App->>FFI_SDK: create trusted SDK (returns handle)
FFI_SDK->>PrefetchTask: spawn async prefetch (detached)
PrefetchTask->>Provider: update_quorum_caches().await
Note over App,Provider: Later, App requests proof verification
App->>Provider: get_quorum_public_key(hash)
alt cache hit
Provider-->>App: return key
else cache miss
Provider->>Thread: spawn OS thread to run async fetch
Thread->>TokioRT: build current-thread runtime
TokioRT->>QuorumHTTP: perform HTTP requests (reqwest)
QuorumHTTP-->>TokioRT: respond with quorum data
TokioRT-->>Thread: return result
Thread-->>Provider: join and deliver key/result
Provider-->>App: return key or error
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
⏳ Review in progress (commit 0d7e82c) |
|
✅ DashSDKFFI.xcframework built for this PR.
SwiftPM (host the zip at a stable URL, then use): .binaryTarget(
name: "DashSDKFFI",
url: "https://your.cdn.example/DashSDKFFI.xcframework.zip",
checksum: "cbe56c4ea36c4a584ba597bba07790f1a661bf641d21fea6f2cf0e14514024fe"
)Xcode manual integration:
|
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/rs-sdk-trusted-context-provider/Cargo.toml`:
- Line 25: The Tokio dev-dependency in Cargo.toml only enables
["macros","rt-multi-thread"] but the crate's tests use tokio networking and
async IO APIs (e.g., tokio::net::TcpListener and tokio::io::{AsyncBufReadExt,
AsyncWriteExt, BufReader}), so update the tokio entry in the [dev-dependencies]
section to include the "net" and "io-util" features (i.e., add "net" and
"io-util" alongside "macros" and "rt-multi-thread") so those APIs are explicitly
enabled for tests.
In `@packages/rs-sdk-trusted-context-provider/src/provider.rs`:
- Around line 915-968: The test spawns two detached threads (the server thread
using server_rt and the client thread that builds rt and calls
provider.get_quorum_public_key) and never joins them or signals the server loop
to stop, leaving a live Tokio runtime; modify the test to keep the JoinHandles,
add a shutdown path for the server loop (e.g., a oneshot or atomic flag the
server loop checks after serving the request or break after a single accept),
signal shutdown once the client has finished (use rx to wait for the result),
then join both thread handles to ensure the server_rt and rt runtimes are
dropped and their threads exit cleanly; reference the spawned closures around
listener.accept(), the server_rt.block_on(...) thread, and the thread that
creates rt and calls provider.get_quorum_public_key so you can locate where to
add the channel/flag, signal, and handle.join().
In `@packages/rs-sdk/src/sync.rs`:
- Around line 89-104: In RuntimeFlavor::CurrentThread branch the worker thread
currently only sends the successful fut output via tx and drops errors/panics,
so callers get only AsyncError::RecvError; change the spawn to return a
JoinHandle and send a Result<F::Output, AsyncError> over tx (wrap build runtime
errors and any panic/worker failures into AsyncError), ensure you call join on
the handle and propagate the joined thread error if it panicked or the runtime
build failed, and have rx.recv() return that Result so the caller receives the
real failure instead of a generic recv error.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 24e588b1-2b8a-4e32-9920-9e8d1542f4d7
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (5)
packages/rs-sdk-ffi/Cargo.tomlpackages/rs-sdk-ffi/src/sdk.rspackages/rs-sdk-trusted-context-provider/Cargo.tomlpackages/rs-sdk-trusted-context-provider/src/provider.rspackages/rs-sdk/src/sync.rs
💤 Files with no reviewable changes (1)
- packages/rs-sdk-ffi/Cargo.toml
…est threads - sync.rs: Send Result<T, AsyncError> over channel in CurrentThread branch so runtime build failures and panics propagate as real errors instead of opaque RecvError. - provider.rs: Add oneshot shutdown signal to mock server, capture and join both thread handles so no detached threads or live runtimes survive the test. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…lity Production code only uses tokio::runtime::Builder::new_current_thread() which needs the "rt" feature alone. The "rt-multi-thread", "net", and "io-util" features are only used in tests, so move them to [dev-dependencies] to avoid breaking the wasm-sdk build. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Fixes a deadlock/panic when synchronous SDK APIs bridge into async work (reqwest/tokio) from within an existing tokio runtime, particularly on current-thread runtimes used by FFI consumers.
Changes:
- Updated
dash_sdk::sync::block_onto handle “no runtime”, “current-thread runtime”, and “multi-thread runtime” cases safely. - Reworked
TrustedHttpContextProvider::get_quorum_public_keyto perform async quorum fetches on a dedicated OS thread with its own tokio runtime to avoid deadlocks. - Added regression tests for both the trusted context provider deadlock and
block_onbehavior; removed diagnostic HTTP checks and the relatedreqwestdependency from the FFI crate.
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/rs-sdk/src/sync.rs | Runtime-flavor-aware block_on implementation + regression test for current-thread runtimes. |
| packages/rs-sdk-trusted-context-provider/src/provider.rs | Avoid deadlock by running quorum fetch in a scoped OS thread with an isolated runtime + regression test. |
| packages/rs-sdk-trusted-context-provider/Cargo.toml | Dependency updates: drop futures, add tokio (and dev features needed by tests). |
| packages/rs-sdk-ffi/src/sdk.rs | Removed diagnostic HTTP probes during trusted SDK creation. |
| packages/rs-sdk-ffi/Cargo.toml | Removed reqwest dependency (previously only used for diagnostics). |
| Cargo.lock | Lockfile updated to reflect dependency changes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /// This test proves the fix works: the same async-sync-async nesting that used to | ||
| /// panic now completes successfully on a current-thread runtime. | ||
| #[test] | ||
| fn test_block_on_fails_on_current_thread_runtime() { |
There was a problem hiding this comment.
The test name says it "fails" on a current-thread runtime, but the body and assertions expect success after the fix. Rename the test to reflect the intended behavior (e.g., succeeds/does_not_panic) so failures are easier to interpret.
| fn test_block_on_fails_on_current_thread_runtime() { | |
| fn test_block_on_succeeds_on_current_thread_runtime() { |
|
|
||
| let result = rt.block_on(outer(rx)); | ||
|
|
||
| rt.block_on(worker_join).ok(); |
There was a problem hiding this comment.
This ignores the JoinHandle result from the worker task. If the task panics or is cancelled, the test will still pass/fail later in a misleading way. Prefer awaiting it with unwrap()/asserting is_ok() so task failures are surfaced deterministically.
| rt.block_on(worker_join).ok(); | |
| rt.block_on(worker_join).unwrap(); |
| let recv_result = rx.recv()?; | ||
| join_handle | ||
| .join() | ||
| .map_err(|_| AsyncError::Generic("block_on worker thread panicked".to_string()))?; | ||
| recv_result |
There was a problem hiding this comment.
In the CurrentThread branch, if rx.recv() returns early (e.g., worker panics before sending), the function returns via ? and the JoinHandle is dropped without joining. That can hide panics and makes failures harder to diagnose. Consider always joining the thread (even on recv error) and mapping a panic to an explicit AsyncError::Generic("... panicked") (or similar) so callers get a clearer error.
| let recv_result = rx.recv()?; | |
| join_handle | |
| .join() | |
| .map_err(|_| AsyncError::Generic("block_on worker thread panicked".to_string()))?; | |
| recv_result | |
| let recv_result = rx.recv(); | |
| let join_result = join_handle.join(); | |
| if join_result.is_err() { | |
| return Err(AsyncError::Generic( | |
| "block_on worker thread panicked".to_string(), | |
| )); | |
| } | |
| recv_result? |
|
Superseded by #3497 |
Issue being fixed or feature implemented
Fixes #3432 — deadlock in
TrustedHttpContextProvider::get_quorum_public_key()when called from within a tokio runtime.What was done?
Root cause
get_quorum_public_keyis a syncContextProvidermethod called during proof verification, which happens inside an async context. The original implementation calledfutures::executor::block_onto drive the async quorum fetch. This deadlocks because:futures::executor::block_onparks the calling thread waiting for the futurereqwestHTTP, which needs tokio's I/O reactorA subsequent fix attempted
tokio::task::block_in_place, which panics on current-thread runtimes (no other worker threads to steal the task).Fix
TrustedHttpContextProvider(rs-sdk-trusted-context-provider): replaced the broken async bridge withstd::thread::scope. A scoped OS thread is spawned with its own independentcurrent_threadtokio runtime. The future runs on that thread's reactor, completely isolated from the outer runtime. The scope guarantees the thread joins before returning, so non-'staticborrows ofselfare safe.dash_sdk::sync::block_on(rs-sdk): upgraded to handle all runtime flavors:current_threadruntime and drives the future directlyCurrentThreadruntime: spawns a dedicated OS thread with its own runtime (block_in_placepanics here)_): existingblock_in_place+ spawn (unchanged, still optimal)RuntimeFlavoris#[non_exhaustive]so the wildcard arm is required; theCurrentThreadarm explicitly handles the problematic case.Regression tests
test_get_quorum_public_key_no_deadlock_inside_tokio_runtime: spins up a mock HTTP server and callsget_quorum_public_keyfrom inside acurrent_threadruntime. Uses a 5-second channel timeout to detect hangs. Verified: old code times out after exactly 5 s; fix completes in ~30 ms.test_block_on_fails_on_current_thread_runtime(renamed to reflect fixed behaviour): runs the same async→sync→async nesting on acurrent_threadruntime and assertsOk("Success").How Has This Been Tested?
cargo test -p rs-sdk-trusted-context-provider— 9 tests pass including the deadlock regression testcargo test -p dash-sdk --lib sync— 45 tests pass including the current-thread regression testBreaking Changes
None.
Checklist:
🤖 Co-authored by Claudius the Magnificent AI Agent
Summary by CodeRabbit
Bug Fixes
Chores
Tests