Skip to content

Conversation

@agustinhenze
Copy link
Contributor

When multiple repository operations execute concurrently on shared pool directories, race conditions could cause .deb files to be deleted despite appearing in repository metadata, resulting in apt 404 errors.

Three distinct but related race conditions were identified and fixed:

  1. Package addition vs publish race: When packages are added to a local repository that is already published, the publish operation could read stale package references before the add transaction commits. Fixed by locking all published repositories that reference the local repo during package addition.

  2. Pool file deletion race: When multiple published repositories share the same pool directory (same storage+prefix) and publish concurrently, cleanup operations could delete each other's newly created files. The cleanup in thread B would:

    • Query database for referenced files (not seeing thread A's uncommitted files)
    • Scan pool directory (seeing thread A's files)
    • Delete thread A's files as "orphaned"

    Fixed by implementing pool-sibling locking: acquire locks on ALL published
    repositories sharing the same storage and prefix before publish/cleanup.

  3. Concurrent cleanup on same prefix: Multiple distributions publishing to the same prefix concurrently could have cleanup operations delete shared files. Fixed by:

    • Adding prefix-level locking to serialize cleanup operations
    • Removing ref subtraction that incorrectly marked shared files as orphaned
    • Forcing database reload before cleanup to see recent commits

The existing task system serializes operations based on resource locks, preventing these race conditions when proper lock sets are acquired.

Test coverage includes concurrent publish scenarios that reliably reproduced all three bugs before the fixes.

Checklist

  • unit-test added
  • functional test added/updated (if change is functional)
  • author name in AUTHORS

@agustinhenze agustinhenze force-pushed the fix-concurrent-publish-race-conditions branch 3 times, most recently from 76a7b8f to 8a47888 Compare December 4, 2025 18:15
@agustinhenze
Copy link
Contributor Author

Sounds my new test is giving timeout (the new test takes longer than 2 minutes, of course depending on the load of the machine). @iofq or @neolynx would you mind taking a look. I can try to fix the timeout later, but an early review would be really appreciated.

I have found these multiple bugs after some stress testing I have done due to production bugs we had randomly.

@agustinhenze agustinhenze force-pushed the fix-concurrent-publish-race-conditions branch from 8a47888 to a9591c7 Compare December 4, 2025 18:41
When multiple repository operations execute concurrently on shared pool
directories, race conditions could cause .deb files to be deleted despite
appearing in repository metadata, resulting in apt 404 errors.

Three distinct but related race conditions were identified and fixed:

1. Package addition vs publish race: When packages are added to a local
   repository that is already published, the publish operation could read
   stale package references before the add transaction commits. Fixed by
   locking all published repositories that reference the local repo during
   package addition.

2. Pool file deletion race: When multiple published repositories share the
   same pool directory (same storage+prefix) and publish concurrently, cleanup
   operations could delete each other's newly created files. The cleanup in
   thread B would:
   - Query database for referenced files (not seeing thread A's uncommitted files)
   - Scan pool directory (seeing thread A's files)
   - Delete thread A's files as "orphaned"

   Fixed by implementing pool-sibling locking: acquire locks on ALL published
   repositories sharing the same storage and prefix before publish/cleanup.

3. Concurrent cleanup on same prefix: Multiple distributions publishing to the
   same prefix concurrently could have cleanup operations delete shared files.
   Fixed by:
   - Adding prefix-level locking to serialize cleanup operations
   - Removing ref subtraction that incorrectly marked shared files as orphaned
   - Forcing database reload before cleanup to see recent commits

The existing task system serializes operations based on resource locks,
preventing these race conditions when proper lock sets are acquired.

Test coverage includes concurrent publish scenarios that reliably reproduced
all three bugs before the fixes.
@agustinhenze agustinhenze force-pushed the fix-concurrent-publish-race-conditions branch from a9591c7 to 47b362c Compare December 4, 2025 19:18
@neolynx neolynx self-assigned this Dec 4, 2025
@mkdir -p /tmp/aptly-etcd-data; system/t13_etcd/start-etcd.sh > /tmp/aptly-etcd-data/etcd.log 2>&1 &
@echo "\e[33m\e[1mRunning go test ...\e[0m"
faketime "$(TEST_FAKETIME)" go test -v ./... -gocheck.v=true -check.f "$(TEST)" -coverprofile=unit.out; echo $$? > .unit-test.ret
faketime "$(TEST_FAKETIME)" go test -timeout 20m -v ./... -gocheck.v=true -check.f "$(TEST)" -coverprofile=unit.out; echo $$? > .unit-test.ret
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests should be run with -race to detect race conditions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants