**Kubernetes** - [x] Support network and instance volumes - [ ] Support private container registries (`registry_auth`) - [ ] Support multiple clusters per project **Inference** - [ ] Support PD-disaggregation with NVIDIA Dynamo/vLLM - [ ] Support multi-replica gateways (high availability) - [ ] Gateways on SSH fleets - requires research **Technical debt** - [ ] Implement instance health (incl. GPU health) via Events - create a new event if health status or message per instance changes - [ ] Migrate to Pydantic V2 - [ ] Multi-tenancy /SSH proxy docs & UI/CLI integration - 1) document better the current isolation; 2) document better how to use SSH proxy; 3) polish SSH proxy UI/CLI integration **Documentation** - [ ] A dedicated API guide with examples - cover all the CLI functionality (in addition to the reference documentation) - [ ] Skills guide (improve skills, plus add a dedicated guide on how to use dstack via agents) - [ ] Distributed training examples, incl. RL - refresh existing examples or/and add better examples, incl TRL; look at [Miles](https://github.com/radixark/miles) **Benchmarks** - [ ] PD-disaggregation **Other / Minor** - [ ] Orphaned resources - allow dstack server to detect orphaned instances (and other related resources) - [ ] Sandboxes - consider supporting a run configuration type - requires research - [ ] [Monarch](https://github.com/meta-pytorch/monarch) integration - [ ] CLI performance - [ ] Benchmark the overhead the gateway adds
Kubernetes
registry_auth)Inference
Technical debt
Documentation
Benchmarks
Other / Minor