test(resolver): add tests for cache thread-safety and defensive copies by LalatenduMohanty · Pull Request #1027 · python-wheel-build/fromager

LalatenduMohanty · 2026-04-06T04:41:25Z

fix(resolver): make resolver cache thread-safe with per-identifier locking

During parallel builds, multiple threads resolve packages concurrently via build_parallel -> _is_wheel_built -> resolver. The shared class-level resolver_cache had no synchronization, so concurrent threads could call find_candidates() redundantly for the same package and overwrite each other's results.

Fix by adding a per-package lock so that only the first thread fetches candidates while others wait and reuse the cached result. Different packages still resolve concurrently. Also return copies from the cache so callers cannot accidentally mutate shared state, and treat empty.

Closes: #1024

Co-Authored-By: Claude <claude@anthropic.co
Signed-off-by: Lalatendu Mohanty lmohanty@redhat.com

coderabbitai · 2026-04-06T04:41:38Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Introduces thread-safe candidate caching in src/fromager/resolver.py: adds a class-level meta-lock, a per-identifier lock map with a _get_identifier_lock() helper, _get_cached_candidates() that returns a defensive copy (or None), and _set_cached_candidates() to atomically store copies. _find_cached_candidates() now uses per-id locking, avoids in-place mutation, and supports a non-cached materialization path. Adds tests in tests/test_resolver.py including helpers and providers, a defensive-copy test, and a multithreaded test ensuring find_candidates() is invoked exactly once.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main changes: adding tests for cache thread-safety and defensive copies to the resolver module.
Linked Issues check	✅ Passed	The PR fully addresses issue `#1024` objectives: implements per-identifier locking to prevent cache corruption [1024], returns defensive copies to prevent external mutation [1024], and adds tests verifying both defensive-copy behavior and thread-safe cache population [1024].
Out of Scope Changes check	✅ Passed	All changes are tightly scoped to resolver cache thread-safety: test scaffolding and two new tests in test_resolver.py, and cache synchronization logic in resolver.py. No unrelated modifications detected.
Description check	✅ Passed	The PR description directly addresses the changeset: it explains the thread-safety issue in resolver_cache, describes the fix (per-identifier locks and defensive copies), and references the related issue.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/test_resolver.py`:
- Around line 1355-1363: The test currently uses a single _SlowProvider instance
for all threads so it doesn't exercise resolver.BaseProvider.resolver_cache
across instances; modify the thread setup so each thread gets its own provider
instance (e.g., create a list of providers = [_SlowProvider() for _ in range(4)]
and pass providers[i] into resolve_in_thread or construct a new _SlowProvider()
in the Thread args) so the class-scoped resolver_cache and any cross-instance
locking/racing are actually tested; update references to the single provider
variable accordingly.
- Around line 1311-1313: The test improperly seeds the cache by appending to the
list returned by _get_cached_candidates(identifier), which will break if that
method returns a defensive copy; instead directly populate the provider's
internal cache storage (e.g. set provider._cached_candidates[identifier] =
[_make_candidate("test-pkg", "1.0.0")] or use the provider's explicit cache
write helper if one exists) so the cache state is actually mutated for the test;
update the lines that call _get_cached_candidates to write into
provider._cached_candidates (or the appropriate internal cache structure) rather
than appending to the returned list.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 01abf8fd-b526-4448-83f9-d08a75259191

📥 Commits

Reviewing files that changed from the base of the PR and between b5df8e2 and e441fe9.

📒 Files selected for processing (1)

tests/test_resolver.py

coderabbitai

🧹 Nitpick comments (2)

tests/test_resolver.py (1)
1327-1372: Thread-safety test properly exercises the class-level cache.

Using separate provider instances (line 1354) per thread correctly tests the shared class-level resolver_cache. The barrier ensures all threads hit the cache simultaneously.

One robustness note: t.join(timeout=10) doesn't raise if threads are still alive. Consider checking t.is_alive() after joins to fail fast on unexpected hangs:
     for t in threads:
         t.join(timeout=10)
+    assert not any(t.is_alive() for t in threads), "Threads did not complete in time"
This prevents silent test passes when threads hang.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_resolver.py` around lines 1327 - 1372, The test
test_find_cached_candidates_thread_safe should fail fast if any thread hangs:
after joining each thread (the for t in threads: t.join(timeout=10) loop) check
each thread's liveness (using t.is_alive()) and raise/assert if any thread
remains alive so the test fails instead of silently passing; update the test to
perform this liveness check after the join loop (or immediately after each join)
referencing the threads list and Thread objects created in resolve_in_thread to
detect and report hangs.
src/fromager/resolver.py (1)
612-621: Empty result caching edge case.

If find_candidates() returns an empty list, if cached_candidates: evaluates to False on subsequent calls, causing repeated invocations. Consider using a sentinel or None to distinguish "cache miss" from "cached empty":
-        cached_candidates = self._get_cached_candidates(identifier)
-        if cached_candidates:
+        cache_key = (type(self), self.cache_key)
+        provider_cache = self.resolver_cache.get(identifier, {})
+        if cache_key in provider_cache:
+            cached_candidates = list(provider_cache[cache_key])
             logger.debug(...)
             return cached_candidates
This is pre-existing behavior and may be acceptable if empty results are rare or cheap to recompute.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/fromager/resolver.py` around lines 612 - 621, The current
cached_candidates truthiness check treats an empty list as a cache miss and
causes repeated recomputation; change the caching logic so
_get_cached_candidates returns None for a miss and stores/returns an actual
empty list when a lookup succeeded but produced no results, then replace the
conditional in the resolver (the block using _get_identifier_lock and
cached_candidates) to test "cached_candidates is not None" (or compare against a
sentinel) so an explicit cached empty list is honored and avoids repeated
find_candidates() calls.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/fromager/resolver.py`:
- Around line 612-621: The current cached_candidates truthiness check treats an
empty list as a cache miss and causes repeated recomputation; change the caching
logic so _get_cached_candidates returns None for a miss and stores/returns an
actual empty list when a lookup succeeded but produced no results, then replace
the conditional in the resolver (the block using _get_identifier_lock and
cached_candidates) to test "cached_candidates is not None" (or compare against a
sentinel) so an explicit cached empty list is honored and avoids repeated
find_candidates() calls.

In `@tests/test_resolver.py`:
- Around line 1327-1372: The test test_find_cached_candidates_thread_safe should
fail fast if any thread hangs: after joining each thread (the for t in threads:
t.join(timeout=10) loop) check each thread's liveness (using t.is_alive()) and
raise/assert if any thread remains alive so the test fails instead of silently
passing; update the test to perform this liveness check after the join loop (or
immediately after each join) referencing the threads list and Thread objects
created in resolve_in_thread to detect and report hangs.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 11662622-87de-4775-8478-b23f82a0c01e

📥 Commits

Reviewing files that changed from the base of the PR and between 3f56e40 and 03b4d17.

📒 Files selected for processing (2)

src/fromager/resolver.py
tests/test_resolver.py

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/fromager/resolver.py`:
- Around line 583-595: _get_cached_candidates currently uses setdefault to
create and return an empty list on cache miss, which collapses "no cached
result" and "cached empty result"; change the logic so misses and cached empty
lists are distinguishable: in _get_cached_candidates (and similarly at lines
noted around 626-633) do not create a new list on miss—look up provider_cache
with provider_cache.get((cls, self.cache_key)) or test key presence and return
None for a miss, and update _find_cached_candidates (or the caller that checks
the cache) to treat a cached empty list as a cache hit by checking for key
presence (is not None or key in provider_cache / resolver_cache) instead of
truthiness, ensuring find_candidates() is not re-run for previously-cached empty
results.
- Around line 617-621: The debug log calls that currently use logger.debug("%s:
...", identifier, ...) (around the unfiltered candidates message and the cache
hit/miss messages in the resolver) must be executed inside the per-request
logging context helper; wrap each of those logger.debug calls in a with
req_ctxvar_context(): block so they carry the standard per-requirement context
(use the existing identifier variable as before and apply this change to the
block around the unfiltered candidates message and the subsequent cache hit/miss
debug lines referenced in the same function).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1e32f653-084a-455d-9f76-697fe6b0c32d

📥 Commits

Reviewing files that changed from the base of the PR and between 03b4d17 and 67541ae.

📒 Files selected for processing (2)

src/fromager/resolver.py
tests/test_resolver.py

🚧 Files skipped from review as they are similar to previous changes (1)

tests/test_resolver.py

tiran · 2026-04-07T08:36:49Z

_get_cached_candidates returns a direct reference to the internal cache list, allowing callers to corrupt shared state by mutation.

Simpler solution: The caller is not allowed to corrupt the state of the cache.

The method _get_cached_candidates and _find_cached_candidates are internal, private implementation details. The methods are carefully written to use atomic dict and list operations. They don't corrupt the caches.

_find_cached_candidates has no synchronization, so concurrent threads bypass the cache and redundantly call find_candidates().

Is that actually a problem in real life? AFAIK only bootstrap phase is resolving packages. The bootstrap phase is single threaded.

LalatenduMohanty · 2026-04-16T19:39:41Z

Is that actually a problem in real life? AFAIK only bootstrap phase is resolving packages. The bootstrap phase is single threaded.

The bootstrap resolution phase is single-threaded, but resolution also happens during build_parallel from multiple threads:

  build_parallel (ThreadPoolExecutor)                                                                                                             
    -> _build_parallel (thread)
      -> _build                                                                                                                                   
        -> _is_wheel_built                                                                                  
          -> wheels.resolve_prebuilt_wheel
            -> resolver.find_all_matching_from_provider
              -> provider.find_matches                                                                                                            
                -> provider._find_cached_candidates

When checking whether a wheel is already built, multiple threads concurrently call into _find_cached_candidates via PyPIProvider.

The method _get_cached_candidates and _find_cached_candidates are internal, private implementation details. The methods are carefully written to use atomic dict and list operations. They don't corrupt the caches.

Agreed, the current callers in find_matches() only iterate the returned list, they don't mutate it. The defensive copy is a safety net, not a fix.

…cking Add per-identifier locks to BaseProvider.resolver_cache so parallel builds do not corrupt candidate lists. Return defensive copies from _get_cached_candidates and distinguish None (cache miss) from an empty list (valid cached result) to prevent redundant find_candidates calls. Closes: python-wheel-build#1024 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lalatendu Mohanty <lmohanty@redhat.com>

tiran · 2026-04-17T11:40:29Z

I argue that there is nothing to fix here.

In worst case, two threads look up the same package and write to the cache at the same time. The last write wins. This cannot corrupt the cache data structures, even when the GIL is disabled.

dict.setdefault() is a thread-safe lookup-and-set operation
list.__setitem__ (candidate[:] = [...]) is a thread-safe slice set operation

See https://docs.python.org/3/library/threadsafety.html

LalatenduMohanty requested a review from a team as a code owner April 6, 2026 04:41

mergify bot added the ci label Apr 6, 2026

coderabbitai bot reviewed Apr 6, 2026

View reviewed changes

Comment thread tests/test_resolver.py Outdated

Comment thread tests/test_resolver.py Outdated

LalatenduMohanty force-pushed the fix/resolver-cache-thread-safety branch from 3f56e40 to 03b4d17 Compare April 6, 2026 12:52

coderabbitai bot reviewed Apr 6, 2026

View reviewed changes

LalatenduMohanty marked this pull request as draft April 6, 2026 13:05

LalatenduMohanty force-pushed the fix/resolver-cache-thread-safety branch from 03b4d17 to 67541ae Compare April 6, 2026 13:38

LalatenduMohanty marked this pull request as ready for review April 6, 2026 14:22

coderabbitai bot reviewed Apr 6, 2026

View reviewed changes

Comment thread src/fromager/resolver.py Outdated

Comment thread src/fromager/resolver.py

rd4398 reviewed Apr 6, 2026

View reviewed changes

Comment thread src/fromager/resolver.py

Comment thread src/fromager/resolver.py Outdated

Comment thread tests/test_resolver.py

LalatenduMohanty force-pushed the fix/resolver-cache-thread-safety branch from 8218e5b to f93a1b3 Compare April 16, 2026 19:48

LalatenduMohanty force-pushed the fix/resolver-cache-thread-safety branch from f93a1b3 to 13a6c1c Compare April 16, 2026 21:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(resolver): add tests for cache thread-safety and defensive copies#1027

test(resolver): add tests for cache thread-safety and defensive copies#1027
LalatenduMohanty wants to merge 1 commit intopython-wheel-build:mainfrom
LalatenduMohanty:fix/resolver-cache-thread-safety

LalatenduMohanty commented Apr 6, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Apr 6, 2026 •

edited

Loading

Reviews paused

Walkthrough

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tiran commented Apr 7, 2026

Uh oh!

LalatenduMohanty commented Apr 16, 2026 •

edited

Loading

Uh oh!

tiran commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

LalatenduMohanty commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tiran commented Apr 7, 2026

Uh oh!

LalatenduMohanty commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tiran commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

LalatenduMohanty commented Apr 6, 2026 •

edited

Loading

coderabbitai bot commented Apr 6, 2026 •

edited

Loading

LalatenduMohanty commented Apr 16, 2026 •

edited

Loading