Joebecher/ccmrg 2042 cap bundle analysis processor at 10 attempts and fix#712
Merged
drazisil-codecov merged 7 commits intomainfrom Feb 18, 2026
Conversation
…ry logic Cap total attempts at 10 (not 10 retries + 1) for BundleAnalysisProcessorTask and LockManager so we stop after 10 tries. Add Redis-backed attempt counter in LockManager for lock contention so broker re-deliveries with unchanged headers do not retry indefinitely. BaseCodecovTask._has_exceeded_max_attempts now takes max_attempts and compares to attempts (retries + 1 or header). On generic exception in bundle processor, return and set upload to error instead of re-raising to avoid unbounded retries. Update tests: mock request for _has_exceeded_max_attempts, set mock_redis.incr.return_value where LockManager compares attempts, and adjust cleanup test to expect return instead of raised ValueError. Refs CCMRG-2042 Co-authored-by: Cursor <cursoragent@cursor.com>
- LockManager: extract _clear_lock_attempt_counter to remove nested try in locked() - Upload finisher: log max_attempts as UPLOAD_PROCESSOR_MAX_RETRIES (not +1) - Lock_manager: comment TTL intent instead of restating 24h - Tests: remove hard-coded (10) from comments; use max_attempts wording Co-authored-by: Cursor <cursoragent@cursor.com>
- BaseCodecovTask: doc and safe_retry use max_retries; drop max_attempts property - LockRetry: max_attempts -> max_retries (same semantics: max total attempts) - LockManager/bundle_analysis_processor/upload_finisher: log and Sentry use max_retries - Tests: LockRetry(max_retries=...), comments say max_retries - celery_config: one-line convention (max_retries = max total attempts) - Fix duplicate dict keys in lock_manager and upload_finisher Refs CCMRG-2042 Co-authored-by: Cursor <cursoragent@cursor.com>
…ssor-at-10-attempts-and-fix Resolve lock_manager conflict: keep attempt counter and max_retries logic, use self.base_retry_countdown (from main) for countdown calculation. Co-authored-by: Cursor <cursoragent@cursor.com>
…test LockManager uses redis incr for attempt count; mock must return an int so attempts >= max_retries does not raise TypeError. Refs CCMRG-2042 Co-authored-by: Cursor <cursoragent@cursor.com>
When LockManager's Redis attempt counter hits max_retries before the task's attempt count (e.g. re-deliveries), it raises LockRetry(max_retries_exceeded=True, countdown=0). ManualTriggerTask only checked self._has_exceeded_max_attempts(), so it fell through to self.retry(countdown=0) and caused rapid zero-delay retries. Align with preprocess_upload and other callers: check retry.max_retries_exceeded or self._has_exceeded_max_attempts() and return failure dict when either is true. Add test for the Redis-counter path. Co-authored-by: Cursor <cursoragent@cursor.com>
- Return processing_results in max-retries-exceeded path for consistent defensive isinstance behavior (bundle_analysis_processor) - Consolidate redis exception imports; add blank line between lock_name and attempt_key in LockManager.locked() (lock_manager) Refs CCMRG-2042 Co-authored-by: Cursor <cursoragent@cursor.com>
apps/worker/services/lock_manager.py
Outdated
Comment on lines
-179
to
181
| "max_attempts": max_attempts, | ||
| "max_retries": max_retries, | ||
| "repoid": self.repoid, |
There was a problem hiding this comment.
Bug: The Redis INCR and EXPIRE commands for the lock attempt counter are not atomic. A failure between these calls can create a counter with no TTL, permanently blocking the lock.
Severity: HIGH
Suggested Fix
To ensure atomicity, use a Redis transaction (MULTI/EXEC) or a Lua script to perform the INCR and EXPIRE operations together. This guarantees that either both commands succeed or neither does, preventing the creation of a counter key without a TTL.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.
Location: apps/worker/services/lock_manager.py#L179-L181
Potential issue: When lock acquisition fails, the code increments a Redis counter
(`INCR`) and then sets a TTL on it (`EXPIRE`). These are two separate, non-atomic
operations. If the `INCR` call succeeds but the `EXPIRE` call fails (e.g., due to a
Redis timeout or connection error), the counter key will be created without a TTL. If
this counter reaches the maximum retry limit, it will never be cleared, as clearing only
happens on successful lock acquisition. This will permanently block any future attempts
to acquire the same lock, effectively poisoning it.
Did we get this right? 👍 / 👎 to inform future reviews.
jason-ford-codecov
approved these changes
Feb 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
bump ci
Note
Medium Risk
Changes core retry/lock-acquisition control flow and error handling for multiple Celery tasks, which can alter when tasks stop retrying and how failures are acknowledged. While well-covered by updated tests, misconfiguration or edge cases could prematurely stop processing or change chain outcomes.
Overview
Aligns retry semantics across workers so
max_retriesconsistently means maximum total attempts (stop when attempts >= max_retries), updatingBaseCodecovTask._has_exceeded_max_attempts/safe_retryand related config comments.Enhances
LockManagerto track lock-acquisition attempts in Redis (lock_attempts:*with 1-day TTL), raiseLockRetry(max_retries_exceeded=True)when the counter hits the cap, and clear the counter after a successful lock; tasks (bundle_analysis_processor,manual_trigger,preprocess_upload,upload_finisher, and notification/finisher tasks) now honor this cap to avoid infinite retry loops on lock contention or message re-delivery.Refactors
bundle_analysis_processorerror paths to log consistently and to set upload state to error (best-effort commit) while returningprevious_resultinstead of re-raising in several failure cases, preserving Celerychainbehavior; updates unit/integration tests accordingly (notably mockingredis.incrand asserting the new attempt-based boundaries/log fields).Written by Cursor Bugbot for commit 1fe32c5. This will update automatically on new commits. Configure here.