test(resolver): add tests for cache thread-safety and defensive copies#1027
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughIntroduces thread-safe candidate caching in Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/test_resolver.py`:
- Around line 1355-1363: The test currently uses a single _SlowProvider instance
for all threads so it doesn't exercise resolver.BaseProvider.resolver_cache
across instances; modify the thread setup so each thread gets its own provider
instance (e.g., create a list of providers = [_SlowProvider() for _ in range(4)]
and pass providers[i] into resolve_in_thread or construct a new _SlowProvider()
in the Thread args) so the class-scoped resolver_cache and any cross-instance
locking/racing are actually tested; update references to the single provider
variable accordingly.
- Around line 1311-1313: The test improperly seeds the cache by appending to the
list returned by _get_cached_candidates(identifier), which will break if that
method returns a defensive copy; instead directly populate the provider's
internal cache storage (e.g. set provider._cached_candidates[identifier] =
[_make_candidate("test-pkg", "1.0.0")] or use the provider's explicit cache
write helper if one exists) so the cache state is actually mutated for the test;
update the lines that call _get_cached_candidates to write into
provider._cached_candidates (or the appropriate internal cache structure) rather
than appending to the returned list.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 01abf8fd-b526-4448-83f9-d08a75259191
📒 Files selected for processing (1)
tests/test_resolver.py
3f56e40 to
03b4d17
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (2)
tests/test_resolver.py (1)
1327-1372: Thread-safety test properly exercises the class-level cache.Using separate provider instances (line 1354) per thread correctly tests the shared class-level
resolver_cache. The barrier ensures all threads hit the cache simultaneously.One robustness note:
t.join(timeout=10)doesn't raise if threads are still alive. Consider checkingt.is_alive()after joins to fail fast on unexpected hangs:for t in threads: t.join(timeout=10) + assert not any(t.is_alive() for t in threads), "Threads did not complete in time"This prevents silent test passes when threads hang.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_resolver.py` around lines 1327 - 1372, The test test_find_cached_candidates_thread_safe should fail fast if any thread hangs: after joining each thread (the for t in threads: t.join(timeout=10) loop) check each thread's liveness (using t.is_alive()) and raise/assert if any thread remains alive so the test fails instead of silently passing; update the test to perform this liveness check after the join loop (or immediately after each join) referencing the threads list and Thread objects created in resolve_in_thread to detect and report hangs.src/fromager/resolver.py (1)
612-621: Empty result caching edge case.If
find_candidates()returns an empty list,if cached_candidates:evaluates toFalseon subsequent calls, causing repeated invocations. Consider using a sentinel orNoneto distinguish "cache miss" from "cached empty":- cached_candidates = self._get_cached_candidates(identifier) - if cached_candidates: + cache_key = (type(self), self.cache_key) + provider_cache = self.resolver_cache.get(identifier, {}) + if cache_key in provider_cache: + cached_candidates = list(provider_cache[cache_key]) logger.debug(...) return cached_candidatesThis is pre-existing behavior and may be acceptable if empty results are rare or cheap to recompute.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/fromager/resolver.py` around lines 612 - 621, The current cached_candidates truthiness check treats an empty list as a cache miss and causes repeated recomputation; change the caching logic so _get_cached_candidates returns None for a miss and stores/returns an actual empty list when a lookup succeeded but produced no results, then replace the conditional in the resolver (the block using _get_identifier_lock and cached_candidates) to test "cached_candidates is not None" (or compare against a sentinel) so an explicit cached empty list is honored and avoids repeated find_candidates() calls.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/fromager/resolver.py`:
- Around line 612-621: The current cached_candidates truthiness check treats an
empty list as a cache miss and causes repeated recomputation; change the caching
logic so _get_cached_candidates returns None for a miss and stores/returns an
actual empty list when a lookup succeeded but produced no results, then replace
the conditional in the resolver (the block using _get_identifier_lock and
cached_candidates) to test "cached_candidates is not None" (or compare against a
sentinel) so an explicit cached empty list is honored and avoids repeated
find_candidates() calls.
In `@tests/test_resolver.py`:
- Around line 1327-1372: The test test_find_cached_candidates_thread_safe should
fail fast if any thread hangs: after joining each thread (the for t in threads:
t.join(timeout=10) loop) check each thread's liveness (using t.is_alive()) and
raise/assert if any thread remains alive so the test fails instead of silently
passing; update the test to perform this liveness check after the join loop (or
immediately after each join) referencing the threads list and Thread objects
created in resolve_in_thread to detect and report hangs.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 11662622-87de-4775-8478-b23f82a0c01e
📒 Files selected for processing (2)
src/fromager/resolver.pytests/test_resolver.py
03b4d17 to
67541ae
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/fromager/resolver.py`:
- Around line 583-595: _get_cached_candidates currently uses setdefault to
create and return an empty list on cache miss, which collapses "no cached
result" and "cached empty result"; change the logic so misses and cached empty
lists are distinguishable: in _get_cached_candidates (and similarly at lines
noted around 626-633) do not create a new list on miss—look up provider_cache
with provider_cache.get((cls, self.cache_key)) or test key presence and return
None for a miss, and update _find_cached_candidates (or the caller that checks
the cache) to treat a cached empty list as a cache hit by checking for key
presence (is not None or key in provider_cache / resolver_cache) instead of
truthiness, ensuring find_candidates() is not re-run for previously-cached empty
results.
- Around line 617-621: The debug log calls that currently use logger.debug("%s:
...", identifier, ...) (around the unfiltered candidates message and the cache
hit/miss messages in the resolver) must be executed inside the per-request
logging context helper; wrap each of those logger.debug calls in a with
req_ctxvar_context(): block so they carry the standard per-requirement context
(use the existing identifier variable as before and apply this change to the
block around the unfiltered candidates message and the subsequent cache hit/miss
debug lines referenced in the same function).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 1e32f653-084a-455d-9f76-697fe6b0c32d
📒 Files selected for processing (2)
src/fromager/resolver.pytests/test_resolver.py
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/test_resolver.py
Simpler solution: The caller is not allowed to corrupt the state of the cache. The method _find_cached_candidates has no synchronization, so concurrent threads bypass the cache and redundantly call find_candidates(). Is that actually a problem in real life? AFAIK only bootstrap phase is resolving packages. The bootstrap phase is single threaded. |
The bootstrap resolution phase is single-threaded, but resolution also happens during build_parallel from multiple threads: When checking whether a wheel is already built, multiple threads concurrently call into _find_cached_candidates via PyPIProvider.
Agreed, the current callers in find_matches() only iterate the returned list, they don't mutate it. The defensive copy is a safety net, not a fix. |
8218e5b to
f93a1b3
Compare
…cking Add per-identifier locks to BaseProvider.resolver_cache so parallel builds do not corrupt candidate lists. Return defensive copies from _get_cached_candidates and distinguish None (cache miss) from an empty list (valid cached result) to prevent redundant find_candidates calls. Closes: python-wheel-build#1024 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lalatendu Mohanty <lmohanty@redhat.com>
f93a1b3 to
13a6c1c
Compare
|
I argue that there is nothing to fix here. In worst case, two threads look up the same package and write to the cache at the same time. The last write wins. This cannot corrupt the cache data structures, even when the GIL is disabled.
|
fix(resolver): make resolver cache thread-safe with per-identifier locking
During parallel builds, multiple threads resolve packages concurrently via build_parallel -> _is_wheel_built -> resolver. The shared class-level resolver_cache had no synchronization, so concurrent threads could call find_candidates() redundantly for the same package and overwrite each other's results.
Fix by adding a per-package lock so that only the first thread fetches candidates while others wait and reuse the cached result. Different packages still resolve concurrently. Also return copies from the cache so callers cannot accidentally mutate shared state, and treat empty.
Closes: #1024
Co-Authored-By: Claude <claude@anthropic.co
Signed-off-by: Lalatendu Mohanty lmohanty@redhat.com