Skip to content

Commit 4ccffe5

Browse files
fake0fann00909098knlnguyen1802herotai214gemini-code-assist[bot]
authored
[Core] Encoder separation for Encode-Prefill-Decode Disaggregation (#25233)
Signed-off-by: n00909098 <nguyen.kha.long@huawei.com> Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> Signed-off-by: herotai214 <herotai214@gmail.com> Signed-off-by: Khuong Le <khuong.le.manh@huawei.com> Signed-off-by: Khuong Le <lemanhkhuong2611@gmail.com> Co-authored-by: n00909098 <nguyen.kha.long@huawei.com> Co-authored-by: knlnguyen1802 <knlnguyen1802@gmail.com> Co-authored-by: herotai214 <herotai214@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Khuong Le <khuong.le.manh@huawei.com> Co-authored-by: Khuong Le <lemanhkhuong2611@gmail.com>
1 parent cbb799e commit 4ccffe5

File tree

31 files changed

+5025
-41
lines changed

31 files changed

+5025
-41
lines changed
83.9 KB
Loading

docs/features/disagg_encoder.md

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
# Disaggregated Encoder
2+
3+
A **disaggregated encoder** runs the vision-encoder stage of a multimodal LLM in a process that is separate from the pre-fill / decoder stage. Deploying these two stages in independent vLLM instances brings three practical benefits:
4+
5+
1. **Independent, fine-grained scaling**
6+
2. **Lower time-to-first-token (TTFT)**
7+
3. **Cross-process reuse and caching of encoder outputs**
8+
9+
Design doc: <https://docs.google.com/document/d/1aed8KtC6XkXtdoV87pWT0a8OJlZ-CpnuLLzmR8l9BAE>
10+
11+
---
12+
13+
## 1 Motivation
14+
15+
### 1. Independent, fine-grained scaling
16+
17+
* Vision encoders are lightweight, while language models are orders of magnitude larger.
18+
* The language model can be parallelised without affecting the encoder fleet.
19+
* Encoder nodes can be added or removed independently.
20+
21+
### 2. Lower time-to-first-token (TTFT)
22+
23+
* Language-only requests bypass the vision encoder entirely.
24+
* Encoder output is injected only at required attention layers, shortening the pre-fill critical path.
25+
26+
### 3. Cross-process reuse and caching
27+
28+
* In-process encoders confine reuse to a single worker.
29+
* A remote, shared cache lets any worker retrieve existing embeddings, eliminating redundant computation.
30+
31+
---
32+
33+
## 2 Usage Example
34+
35+
The current reference pathway is **SharedStorageConnector**.
36+
Below ready-to-run scripts shows the workflow:
37+
38+
1 Encoder instance + 1 PD instance:
39+
`examples/online_serving/disaggregated_encoder/shared_storage_connector/disagg_encoder_example.sh`
40+
41+
1 Encoder instance + 1 Prefill instance + 1 Decode instance:
42+
`examples/online_serving/disaggregated_encoder/shared_storage_connector/disagg_epd_example.sh`
43+
44+
---
45+
46+
## 3 Test Script
47+
48+
Please refer to the directories `tests/v1/ec_connector`
49+
50+
## 4 Development
51+
52+
Disaggregated encoding is implemented by running two parts:
53+
54+
* **Encoder instance** – a vLLM instance to performs vision encoding.
55+
* **Prefill/Decode (PD) instance(s)** – runs language pre-fill and decode.
56+
* PD can be in either a single normal instance with `disagg_encoder_example.sh` (E->PD) or in disaggregated instances with `disagg_epd_example.sh` (E->P->D)
57+
58+
A connector transfers encoder-cache (EC) embeddings from the encoder instance to the PD instance.
59+
All related code is under `vllm/distributed/ec_transfer`.
60+
61+
### Key abstractions
62+
63+
* **ECConnector** – interface for retrieving EC caches produced by the encoder.
64+
* *Scheduler role* – checks cache existence and schedules loads.
65+
* *Worker role* – loads the embeddings into memory.
66+
67+
Here is a figure illustrating disaggregate encoder flow:
68+
69+
![Disaggregated Encoder Flow](../assets/features/disagg_encoder/disagg_encoder_flow.png)
70+
71+
For the PD disaggregation part, the Prefill instance receive cache exactly the same as the disaggregate encoder flow above. Prefill instance executes 1 step (prefill -> 1 token output) and then transfer KV cache to the Decode instance for the remaining execution. The KV transfer part purely happens after the execute of the PDinstance.
72+
73+
`docs/features/disagg_prefill.md` shows the brief idea about the disaggregated prefill (v0)
74+
75+
We create the example setup with the **NixlConnector** from `vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py` and referred to the `tests/v1/kv_connector/nixl_integration/toy_proxy_server.py` to facilitate the kv transfer between P and D;
Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
# Disaggregated Encoder
2+
3+
These example scripts that demonstrate the disaggregated encoder (EPD) features of vLLM.
4+
5+
For a detailed explanation of the EPD features, please refer to the [Disaggregated Encoder Feature Documentation](../../../docs/features/disagg_encoder.md).
6+
7+
## Files
8+
9+
- `disagg_epd_proxy.py` - Proxy script that demonstrates the XeYpZd setup (X encode instances, Y prefill instances, Z decode instances). Currently stable for the 1e1p1d configuration.
10+
11+
- `disagg_1e1p1d_example.sh` - Sets up the 1e1p1d configuration, runs the VisionArena benchmark, and processes a single request with a local image.
12+
13+
- `disagg_1e1pd_example.sh` - Sets up the 1e1pd configuration, runs the VisionArena benchmark, and processes a single request with a local image.
14+
15+
### Custom Configuration
16+
17+
```bash
18+
# Use specific GPUs
19+
GPU_E=0 GPU_PD=1 GPU_P=1 GPU_D=2 bash disagg_1e1p1d_example.sh
20+
21+
# Use specific ports
22+
ENDPOINT_PORT=10001 bash disagg_1e1p1d_example.sh
23+
24+
# Use specific model
25+
MODEL="Qwen/Qwen2.5-VL-3B-Instruct" bash disagg_1e1p1d_example.sh
26+
27+
# Use specific storage path
28+
EC_SHARED_STORAGE_PATH="/tmp/my_ec_cache" bash disagg_1e1p1d_example.sh
29+
```
30+
31+
## Encoder Instances
32+
33+
Encoder engines should be launched with the following flags:
34+
35+
- `--enforce-eager` **(required)** – The current EPD implementation is only compatible with encoder instances running in this mode.
36+
37+
- `--no-enable-prefix-caching` **(required)** – Encoder instances do not consume KV cache; prefix caching is disabled to avoid conflicts with other features.
38+
39+
- `--max-num-batched-tokens=<large value>` **(default: 2048)** – This flag controls the token scheduling budget per decoding step and is irrelevant to encoder-only instances. **Set it to a very high value (effectively unlimited) to bypass scheduler limitations.** The actual token budget is managed by the encoder cache manager.
40+
41+
## Local media inputs
42+
43+
To support local image inputs (from your ```MEDIA_PATH``` directory), add the following flag to the encoder instance:
44+
45+
```bash
46+
--allowed-local-media-path $MEDIA_PATH
47+
```
48+
49+
The vllm instances and `disagg_encoder_proxy` supports local URIs with ```{"url": "file://'"$MEDIA_PATH_FILENAME"'}``` as multimodal inputs. Each URI is passed unchanged from the `disagg_encoder_proxy` to the encoder instance so that the encoder can load the media locally.
50+
51+
## EC connector and KV transfer
52+
53+
The `ECSharedStorageConnector` is used to store the encoder cache on local disk and facilitate transfer. To enable the encoder disaggregation feature, add the following configuration:
54+
55+
```bash
56+
# Add to encoder instance:
57+
--ec-transfer-config '{
58+
"ec_connector": "ECSharedStorageConnector",
59+
"ec_role": "ec_producer",
60+
"ec_connector_extra_config": {
61+
"shared_storage_path": "'"$EC_SHARED_STORAGE_PATH"'"
62+
}
63+
}'
64+
65+
# Add to prefill/prefill+decode instance:
66+
--ec-transfer-config '{
67+
"ec_connector": "ECSharedStorageConnector",
68+
"ec_role": "ec_consumer",
69+
"ec_connector_extra_config": {
70+
"shared_storage_path": "'"$EC_SHARED_STORAGE_PATH"'"
71+
}
72+
}'
73+
```
74+
75+
`$EC_SHARED_STORAGE_PATH` is the path where the EC connector temporarily stores the cache.
76+
77+
If you enable prefill instance (`--prefill-servers-urls` not disabled), you will need --kv-transfer-config to facilitate the PD disaggregation. Currently, we use the `NixlConnector` for this purpose. Refer to `tests/v1/kv_connector/nixl_integration` for more example codes on PD disaggregation with Nixl.
78+
79+
```bash
80+
# Add to prefill instance:
81+
--kv-transfer-config '{
82+
"kv_connector": "NixlConnector",
83+
"kv_role": "kv_producer"
84+
}'
85+
86+
# Add to decode instance:
87+
--kv-transfer-config '{
88+
"kv_connector": "NixlConnector",
89+
"kv_role": "kv_consumer"
90+
}'
91+
```
92+
93+
## Proxy Instance Flags (`disagg_epd_proxy.py`)
94+
95+
| Flag | Description |
96+
|------|-------------|
97+
| `--encode-servers-urls` | Comma-separated list of encoder endpoints. Every multimodal item extracted from the request is fanned out to one of these URLs in a round-robin fashion. |
98+
| `--prefill-servers-urls` | Comma-separated list of prefill endpoints. Set to `disable`, `none`, or `""` to skip the dedicated prefill phase and run E+PD (encoder + combined prefill/decode). |
99+
| `--decode-servers-urls` | Comma-separated list of decode endpoints. Non-stream and stream paths both round-robin over this list. |
100+
| `--host`, `--port` | Bind address for the proxy itself (defaults: `0.0.0.0:8000`). |
101+
102+
Example usage:
103+
For E + PD setup:
104+
105+
```bash
106+
$ python disagg_encoder_proxy.py \
107+
--encode-servers-urls "http://e1:8001,http://e2:8002" \
108+
--prefill-servers-urls "disable" \
109+
--decode-servers-urls "http://pd1:8003,http://pd2:8004"
110+
```
111+
112+
For E + P + D setup:
113+
114+
```bash
115+
$ python disagg_encoder_proxy.py \
116+
--encode-servers-urls "http://e1:8001,http://e2:8001" \
117+
--prefill-servers-urls "http://p1:8003,http://p2:8004" \
118+
--decode-servers-urls "http://d1:8005,http://d2:8006"
119+
```

0 commit comments

Comments
 (0)