Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 28 additions & 21 deletions mkdocs/docs/concepts/services.md
Original file line number Diff line number Diff line change
Expand Up @@ -357,7 +357,6 @@ Below is an example for running `zai-org/GLM-4.5-Air-FP8` on `H200`:
```yaml
type: service
name: prefill-decode
image: lmsysorg/sglang:v0.5.10.post1

env:
- HF_TOKEN
Expand All @@ -366,61 +365,69 @@ Below is an example for running `zai-org/GLM-4.5-Air-FP8` on `H200`:
replicas:
- count: 1
# For now replica group with router must have count: 1
python: "3.12"
commands:
- pip install smg
- |
smg launch \
--enable-igw \
--pd-disaggregation \
--model-path $MODEL_ID \
--host 0.0.0.0 \
--port 8000 \
--pd-disaggregation \
--prefill-policy cache_aware
resources:
cpu: 4
router:
type: sglang
resources:
cpu: 4

- count: 1..4
- count: 1..2
scaling:
metric: rps
target: 3
target: 300
image: ghcr.io/lightseekorg/smg:1.4.1-sglang-v0.5.10
commands:
- |
python -m sglang.launch_server \
python3 -m sglang.launch_server \
--model-path $MODEL_ID \
--host 0.0.0.0 \
--port 8000 \
--grpc-mode \
--disaggregation-mode prefill \
--disaggregation-transfer-backend nixl \
--port 8000 \
--disaggregation-bootstrap-port 8998
resources:
gpu: H200

- count: 1..8
- count: 1..4
scaling:
metric: rps
target: 2
target: 300
image: ghcr.io/lightseekorg/smg:1.4.1-sglang-v0.5.10
commands:
- |
python -m sglang.launch_server \
python3 -m sglang.launch_server \
--model-path $MODEL_ID \
--host 0.0.0.0 \
--port 8000 \
--grpc-mode \
--disaggregation-mode decode \
--disaggregation-transfer-backend nixl \
--port 8000
--disaggregation-transfer-backend nixl
resources:
gpu: H200

port: 8000
model: zai-org/GLM-4.5-Air-FP8

# Custom probe is required for PD disaggregation.
probes:
- type: http
url: /health
interval: 15s
```

</div>

> With the `sglang` router, you can use SGLang prefill and decode workers. Support for vLLM and TensorRT-LLM workers is coming soon.
> With the `smg` router, workers communicate via gRPC as well as HTTP.
>
> On the router side, `--enable-igw` and `--model-path` are required for gRPC worker registration via HTTP endpoint. This is how `dstack` registers workers with SMG router.
>
> With SGLang gRPC workers, pass `--grpc-mode` to the worker launch command.To use [Mooncake Transfer](https://github.com/kvcache-ai/Mooncake), set `--disaggregation-transfer-backend mooncake`. For PD disaggregation with SGLang HTTP workers, see [SGLang PD Disaggregation](../examples/inference/sglang.md#pd-disaggregation).
>
> The SMG router supports only gRPC communication mode with vLLM workers. For PD disaggregation with vLLM, see [here](../examples/inference/vllm.md#pd-disaggregation).

=== "Dynamo"

Expand Down
2 changes: 0 additions & 2 deletions mkdocs/docs/examples/inference/sglang.md
Original file line number Diff line number Diff line change
Expand Up @@ -211,8 +211,6 @@ To run SGLang with [PD disaggregation](https://docs.sglang.io/advanced_features/

</div>

> With the `sglang` router, you can use SGLang prefill and decode workers. Support for vLLM and TensorRT-LLM workers is coming soon.

=== "AMD"

The example below deploys `Qwen/Qwen2.5-72B-Instruct` on a multi-node cluster with AMD MI300X GPUs:
Expand Down
78 changes: 78 additions & 0 deletions mkdocs/docs/examples/inference/vllm.md
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,84 @@ curl http://127.0.0.1:3000/proxy/services/main/qwen36/v1/chat/completions \

> If a [gateway](../../concepts/gateways.md) is configured (e.g. to enable auto-scaling, HTTPS, rate limits, etc.), the service endpoint will be available at `https://qwen36.<gateway domain>/`.

## Configuration options

### PD disaggregation

To run vLLM with [PD disaggregation](https://docs.vllm.ai/en/latest/serving/disagg_prefill.html), use replica groups: one for [Shepherd Model Gateway (SMG)](https://docs.sglang.io/advanced_features/sgl_model_gateway.html), one for prefill workers (`kv_producer`), and one for decode workers (`kv_consumer`).

<div editor-title="pd.dstack.yml">

```yaml
type: service
name: prefill-decode

env:
- HF_TOKEN
- MODEL_ID=zai-org/GLM-4.5-Air-FP8

replicas:
- count: 1
python: "3.12"
commands:
- pip install smg
- |
smg launch \
--pd-disaggregation \
--model-path $MODEL_ID \
--enable-igw \
--host 0.0.0.0 \
--port 8000 \
--prefill-policy cache_aware
router:
type: sglang
resources:
cpu: 4

- count: 1..4
scaling:
metric: rps
target: 3
image: ghcr.io/lightseekorg/smg:1.4.1-vllm-v0.18.0
commands:
- |
python3 -m vllm.entrypoints.grpc_server \
--model "$MODEL_ID" \
--host 0.0.0.0 \
--port 8000 \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_producer"}'
resources:
gpu: H200

- count: 1..8
scaling:
metric: rps
target: 2
image: ghcr.io/lightseekorg/smg:1.4.1-vllm-v0.18.0
commands:
- |
python3 -m vllm.entrypoints.grpc_server \
--model "$MODEL_ID" \
--host 0.0.0.0 \
--port 8000 \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_consumer"}'
resources:
gpu: H200

port: 8000
```

</div>

> To use the [Mooncake Transfer](https://github.com/kvcache-ai/Mooncake) backend, set `"kv_connector": "MooncakeConnector"` in `--kv-transfer-config`.

Currently, auto-scaling only supports `rps` as the metric. TTFT and ITL metrics are coming soon.

!!! info "Cluster"
PD disaggregation requires the service to run in a fleet with `placement` set to `cluster`, because the replicas require an interconnect between instances.

While the prefill and decode replicas run on GPUs, the router replica requires a CPU instance in the same cluster.

## What's next?

1. Read about [services](../../concepts/services.md) and [gateways](../../concepts/gateways.md)
Expand Down
Loading