Skip to content

DAOS-18552 pool: Fix a PS start-stop race (#17564)#17697

Draft
liw wants to merge 1 commit intorelease/2.6from
liw/pool-svc-stop-wa-2.6
Draft

DAOS-18552 pool: Fix a PS start-stop race (#17564)#17697
liw wants to merge 1 commit intorelease/2.6from
liw/pool-svc-stop-wa-2.6

Conversation

@liw
Copy link
Contributor

@liw liw commented Mar 13, 2026

The following race happened during a pool create operation, triggered by abnormally slow VMs:

ds_rsvc_start
  start
    pool_svc_alloc_cb
      ds_pool_lookup: OK
....VM slowness causes start timeout, which triggers stop....
                          ds_pool_stop
                            pool->sp_stopping = 1
                            ds_pool_svc_stop: none
  insert
                            wait for ds_pool references: hang

This patch is a quick fix that prevents ds_rsvc_start from inserting a PS to the hash table if the ds_pool is stopping, so that ds_pool_stop won't hang. Manual testing shows that such a pool create operation will now retry and succeed transparently.

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

The following race happened during a pool create operation, triggered by
abnormally slow VMs:

  ds_rsvc_start
    start
      pool_svc_alloc_cb
        ds_pool_lookup: OK
  ....VM slowness causes start timeout, which triggers stop....
                            ds_pool_stop
                              pool->sp_stopping = 1
                              ds_pool_svc_stop: none
    insert
                              wait for ds_pool references: hang

This patch is a quick fix that prevents ds_rsvc_start from inserting a
PS to the hash table if the ds_pool is stopping, so that ds_pool_stop
won't hang. Manual testing shows that such a pool create operation will
now retry and succeed transparently.

Signed-off-by: Li Wei <liwei@hpe.com>
@github-actions
Copy link

Ticket title is 'rebuild/container_rf.py:RbldContRfTest.test_rebuild_with_container_rf - pool create failed: DER_BUSY(-1012): Device or resource busy'
Status is 'In Review'
Labels: 'ci_master_weekly,weekly_test'
Job should run at elevated priority (1)
https://daosio.atlassian.net/browse/DAOS-18552

@github-actions github-actions bot added the priority Ticket has high priority (automatically managed) label Mar 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority Ticket has high priority (automatically managed)

Development

Successfully merging this pull request may close these issues.

1 participant