Skip to content

[Bug]: Runpod, Kubernetes: it's possible to delete a volume in use #3789

@un-def

Description

@un-def

Steps to reproduce

# volume.dstack.yml
type: volume
name: volume-runpod
backend: runpod
region: eu-nl-1
size: 10GB
# run.dstack.yml
name: dev-environment
ide: vscode
volumes:
  - volume-runpod:/volume
  1. dstack apply -f volume.dstack.yml
  2. dstack apply -f run.dstack.yml
  3. dstack volume delete volume-runpod

Actual behaviour

If run status = submitted:

Error (Volume error)
Volume ['volume-runpod'] is marked for deletion and cannot be attached

If run status = provisioning:

Server processing gets stuck in a loop (see below), the run stays in the provisioning state

Expected behaviour

No response

dstack version

770eaf8

Server logs

If run status = submitted:

INFO     dstack._internal.server.services.volumes:334 Deleting volumes: ['volume-runpod']
INFO     dstack._internal.server.services.events:205 Emitting event: Volume marked for deletion. Event targets:
         volume(88fa96)volume-runpod. Actor: user(efa6c3)dmitry-local-admin
DEBUG    dstack._internal.server.background.pipeline_tasks.base:357 Processing jobs item fcdb64de-c679-43ec-8057-a084f06510d8
DEBUG    dstack._internal.server.background.pipeline_tasks.jobs_submitted:316 job(fcdb64)dev-environment-0-0: provisioning has
         started
WARNING  dstack._internal.server.background.pipeline_tasks.jobs_submitted:809 job(fcdb64)dev-environment-0-0: failed to prepare run
         volumes: ServerClientError("Volume ['volume-runpod'] is marked for deletion and cannot be attached")
INFO     dstack._internal.server.services.events:205 Emitting event: Job status changed SUBMITTED -> TERMINATING. Termination
         reason: VOLUME_ERROR (Volume ['volume-runpod'] is marked for deletion and cannot be attached). Event targets:
         job(fcdb64)dev-environment-0-0. Actor: system
DEBUG    dstack._internal.server.background.pipeline_tasks.base:364 Processed jobs item fcdb64de-c679-43ec-8057-a084f06510d8 in
         0.029
DEBUG    dstack._internal.server.background.pipeline_tasks.base:357 Processing runs item 18611f36-b953-428b-9465-8fba70d1cebd
INFO     dstack._internal.server.services.events:205 Emitting event: Run status changed SUBMITTED -> TERMINATING. Termination
         reason: JOB_FAILED. Event targets: run(18611f)dev-environment. Actor: system

If run status = provisioning:

INFO     dstack._internal.server.services.volumes:334 Deleting volumes: ['volume-runpod']
INFO     dstack._internal.server.services.events:205 Emitting event: Volume marked for deletion. Event targets:
         volume(0f2037)volume-runpod. Actor: user(efa6c3)dmitry-local-admin
DEBUG    dstack._internal.server.background.pipeline_tasks.base:357 Processing runs item 08b2d619-6350-46fc-9e5e-58f109940397
DEBUG    dstack._internal.server.background.pipeline_tasks.base:364 Processed runs item 08b2d619-6350-46fc-9e5e-58f109940397 in
DEBUG    dstack._internal.server.background.pipeline_tasks.base:357 Processing volumes item 0f2037a5-5e28-4fd4-9a4a-b49019b588b6
ERROR    dstack._internal.server.background.pipeline_tasks.volumes:408 Got exception when deleting volume volume-runpod. Please
         terminate it manually to avoid unexpected charges.
         Traceback (most recent call last):
           File "/home/def/dev/dstack/src/dstack/_internal/server/background/pipeline_tasks/volumes.py", line 402, in
         _process_to_be_deleted_volume
             await run_async(
             ...<2 lines>...
             )
           File "/home/def/dev/dstack/src/dstack/_internal/utils/common.py", line 50, in run_async
             return await asyncio.get_running_loop().run_in_executor(None, func_with_args)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           File "/home/def/.local/share/uv/python/cpython-3.13.9-linux-x86_64-gnu/lib/python3.13/concurrent/futures/thread.py",
         line 59, in run
             result = self.fn(*self.args, **self.kwargs)
           File "/home/def/dev/dstack/src/dstack/_internal/core/backends/runpod/compute.py", line 435, in delete_volume
             self.api_client.delete_network_volume(volume_id=volume.volume_id)
             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           File "/home/def/dev/dstack/src/dstack/_internal/core/backends/runpod/api_client.py", line 281, in delete_network_volume
             self._make_request(
             ~~~~~~~~~~~~~~~~~~^
                 {
                 ^
             ...<9 lines>...
                 }
                 ^
             )
             ^
           File "/home/def/dev/dstack/src/dstack/_internal/core/backends/runpod/api_client.py", line 366, in _make_request
             raise RunpodApiClientError(errors=response_json["errors"])
         dstack._internal.core.backends.runpod.api_client.RunpodApiClientError: [{'message': 'You must remove this network volume
         from all pods before deleting it.', 'path': ['deleteNetworkVolume'], 'extensions': {'code': 'RUNPOD'}}]
INFO     dstack._internal.server.services.events:205 Emitting event: Volume deleted. Event targets: volume(0f2037)volume-runpod.
         Actor: system
DEBUG    dstack._internal.server.background.pipeline_tasks.base:364 Processed volumes item 0f2037a5-5e28-4fd4-9a4a-b49019b588b6 in
         0.397
DEBUG    dstack._internal.server.background.pipeline_tasks.base:357 Processing jobs item 1c050618-df47-4088-853e-bcfd62c78a42
ERROR    dstack._internal.server.background.pipeline_tasks.base:361 Unexpected exception when processing item
         Traceback (most recent call last):
           File "/home/def/dev/dstack/src/dstack/_internal/server/background/pipeline_tasks/base.py", line 359, in start
             await self.process(item)
           File "/home/def/dev/dstack/src/dstack/_internal/server/utils/sentry_utils.py", line 28, in wrapper
             return await f(*args, **kwargs)
                    ^^^^^^^^^^^^^^^^^^^^^^^^
           File "/home/def/dev/dstack/src/dstack/_internal/server/background/pipeline_tasks/jobs_running.py", line 301, in process
             result = await _process_running_job(context=context)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           File "/home/def/dev/dstack/src/dstack/_internal/server/background/pipeline_tasks/jobs_running.py", line 424, in
         _process_running_job
             startup_context = await _prepare_startup_context(context=context, result=result)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           File "/home/def/dev/dstack/src/dstack/_internal/server/background/pipeline_tasks/jobs_running.py", line 477, in
         _prepare_startup_context
             volumes = await get_job_attached_volumes(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
             ...<5 lines>...
             )
             ^
           File "/home/def/dev/dstack/src/dstack/_internal/server/services/jobs/__init__.py", line 475, in get_job_attached_volumes
             job_configured_volumes = await get_job_configured_volumes(
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
             ...<4 lines>...
             )
             ^
           File "/home/def/dev/dstack/src/dstack/_internal/server/services/jobs/__init__.py", line 387, in
         get_job_configured_volumes
             volume_models = await get_job_configured_volume_models(
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
             ...<5 lines>...
             )
             ^
           File "/home/def/dev/dstack/src/dstack/_internal/server/services/jobs/__init__.py", line 432, in
         get_job_configured_volume_models
             raise ResourceNotExistsError(f"Volume {mount_point.name} not found")
         dstack._internal.core.errors.ResourceNotExistsError: Volume ['volume-runpod'] not found

And ResourceNotExistsError is then repeated again and again

Additional information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingvolumes

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions