Skip to content

Commit 018c5ef

Browse files
hero0307t00939662
andauthored
[Docs]correct the error in docs (#340)
* [Fix]correct the error in docs --------- Co-authored-by: t00939662 <tianxuehan@huawei.com>
1 parent feb498b commit 018c5ef

File tree

5 files changed

+41
-24
lines changed

5 files changed

+41
-24
lines changed

docs/source/getting-started/installation_npu.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,16 @@ pip install -v -e . --no-build-isolation
6262
cd ..
6363
```
6464

65+
Codes of vLLM and vLLM Ascend are placed in /vllm-workspace, you can refer to [vLLM-Ascend Installation](https://vllm-ascend.readthedocs.io/en/latest/installation.html) for more information. After installation, please apply patches to ensure uc_connector can be used:
66+
```bash
67+
cd /vllm-workspace/vllm
68+
git apply /vllm-workspace/unified-cache-management/ucm/integration/vllm/patch/0.9.2/vllm-adapt.patch
69+
cd /vllm-workspace/vllm-ascend
70+
git apply /vllm-workspace/unified-cache-management/ucm/integration/vllm/patch/0.9.2/vllm-ascend-adapt.patch
71+
```
72+
Refer to these issues [vllm-issue](https://github.com/vllm-project/vllm/issues/21702) and [vllm-ascend-issue](https://github.com/vllm-project/vllm-ascend/issues/2057) to see details of patches' changes.
73+
74+
6575
## Setup from docker
6676
Download the pre-built docker image provided or build unified-cache-management docker image by commands below:
6777
```bash

docs/source/getting-started/quick_start.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -21,14 +21,16 @@ Before you start with UCM, please make sure that you have installed UCM correctl
2121

2222
## Features Overview
2323

24-
UCM supports two key features: **Prefix Cache** and **GSA Sparsity**.
24+
UCM supports two key features: **Prefix Cache** and **Sparse attention**.
2525

2626
Each feature supports both **Offline Inference** and **Online API** modes.
2727

2828
For quick start, just follow the [usage](./quick_start.md) guide below to launch your own inference experience;
2929

30-
For further research, click on the links blow to see more details of each feature:
30+
For further research on Prefix Cache, more details are available via the link below:
3131
- [Prefix Cache](../user-guide/prefix-cache/index.md)
32+
33+
Various Sparse Attention features are now available, try GSA Sparsity via the link below:
3234
- [GSA Sparsity](../user-guide/sparse-attention/gsa.md)
3335

3436
## Usage
@@ -47,7 +49,7 @@ python offline_inference.py
4749

4850
</details>
4951

50-
<details>
52+
<details open>
5153
<summary><b>OpenAI-Compatible Online API</b></summary>
5254

5355
For online inference , vLLM with our connector can also be deployed as a server that implements the OpenAI API protocol.

docs/source/user-guide/pd-disaggregation/1p1d.md

Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -5,16 +5,17 @@ This example demonstrates how to run unified-cache-management with disaggregated
55

66
## Prerequisites
77
- UCM: Installed with reference to the Installation documentation.
8-
- Hardware: At least 2 GPUs
8+
- Hardware: At least 2 GPUs or 2 NPUs
99

1010
## Start disaggregated service
11-
For illustration purposes, let us assume that the model used is Qwen2.5-7B-Instruct.
11+
For illustration purposes, let us take GPU as an example and assume the model used is Qwen2.5-7B-Instruct.Using ASCEND_RT_VISIBLE_DEVICES instead of CUDA_VISIBLE_DEVICES to specify visible devices when starting service on Ascend platform.
1212

1313
### Run prefill server
1414
Prefiller Launch Command:
1515
```bash
1616
export PYTHONHASHSEED=123456
17-
CUDA_VISIBLE_DEVICES=0 vllm serve /home/models/Qwen2.5-7B-Instruct \
17+
export CUDA_VISIBLE_DEVICES=0
18+
vllm serve /home/models/Qwen2.5-7B-Instruct \
1819
--max-model-len 20000 \
1920
--tensor-parallel-size 1 \
2021
--gpu_memory_utilization 0.87 \
@@ -41,8 +42,9 @@ CUDA_VISIBLE_DEVICES=0 vllm serve /home/models/Qwen2.5-7B-Instruct \
4142
### Run decode server
4243
Decoder Launch Command:
4344
```bash
44-
export PYTHONHASHSEED=123456
45-
CUDA_VISIBLE_DEVICES=1 vllm serve /home/models/Qwen2.5-7B-Instruct \
45+
export PYTHONHASHSEED=123456
46+
export CUDA_VISIBLE_DEVICES=0
47+
vllm serve /home/models/Qwen2.5-7B-Instruct \
4648
--max-model-len 20000 \
4749
--tensor-parallel-size 1 \
4850
--gpu_memory_utilization 0.87 \
@@ -68,7 +70,7 @@ CUDA_VISIBLE_DEVICES=1 vllm serve /home/models/Qwen2.5-7B-Instruct \
6870
### Run proxy server
6971
Make sure prefill nodes and decode nodes can connect to each other.
7072
```bash
71-
cd vllm-workspace/unified-cache-management/ucm/pd
73+
cd /vllm-workspace/unified-cache-management/ucm/pd
7274
python3 toy_proxy_server.py --pd-disaggregation --host localhost --port 7802 --prefiller-host <prefill-node-ip> --prefiller-port 7800 --decoder-host <decode-node-ip> --decoder-port 7801
7375
```
7476

@@ -88,8 +90,7 @@ curl http://localhost:7802/v1/completions \
8890
### Benchmark Test
8991
Use the benchmark scripts provided by vLLM.
9092
```bash
91-
cd /vllm-workspace/vllm/benchmarks
92-
python3 benchmark_serving.py \
93+
vllm bench serve \
9394
--backend vllm \
9495
--dataset-name random \
9596
--random-input-len 4096 \

docs/source/user-guide/pd-disaggregation/npgd.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,8 @@ vllm serve /home/models/Qwen2.5-7B-Instruct \
5050
Decoder Launch Command:
5151
```bash
5252
export PYTHONHASHSEED=123456
53-
CUDA_VISIBLE_DEVICES=0 vllm serve /home/models/Qwen2.5-7B-Instruct \
53+
export CUDA_VISIBLE_DEVICES=0
54+
vllm serve /home/models/Qwen2.5-7B-Instruct \
5455
--max-model-len 20000 \
5556
--tensor-parallel-size 1 \
5657
--gpu_memory_utilization 0.87 \
@@ -77,7 +78,7 @@ CUDA_VISIBLE_DEVICES=0 vllm serve /home/models/Qwen2.5-7B-Instruct \
7778
### Run proxy server
7879
Make sure prefill nodes and decode nodes can connect to each other.
7980
```bash
80-
cd vllm-workspace/unified-cache-management/ucm/pd
81+
cd /vllm-workspace/unified-cache-management/ucm/pd
8182
python3 toy_proxy_server.py --host localhost --port 7802 --prefiller-host <prefill-node-ip> --prefiller-port 7800 --decoder-host <decode-node-ip> --decoder-port 7801
8283
```
8384

@@ -97,8 +98,7 @@ curl http://localhost:7802/v1/completions \
9798
### Benchmark Test
9899
Use the benchmark scripts provided by vLLM.
99100
```bash
100-
cd /vllm-workspace/vllm/benchmarks
101-
python3 benchmark_serving.py \
101+
vllm bench serve \
102102
--backend vllm \
103103
--dataset-name random \
104104
--random-input-len 4096 \

docs/source/user-guide/pd-disaggregation/xpyd.md

Lines changed: 13 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -5,15 +5,17 @@ This example demonstrates how to run unified-cache-management with disaggregated
55

66
## Prerequisites
77
- UCM: Installed with reference to the Installation documentation.
8-
- Hardware: At least 4 GPUs (At least 2 GPUs for prefiller + 2 for decoder in 2d2p setup)
8+
- Hardware: At least 4 GPUs (At least 2 GPUs for prefiller + 2 for decoder in 2d2p setup or 2 NPUs for prefiller + 2 for decoder in 2d2p setup)
99

1010
## Start disaggregated service
11-
For illustration purposes, let us assume that the model used is Qwen2.5-7B-Instruct.
11+
For illustration purposes, let us take GPU as an example and assume the model used is Qwen2.5-7B-Instruct.Using ASCEND_RT_VISIBLE_DEVICES instead of CUDA_VISIBLE_DEVICES to specify visible devices when starting service on Ascend platform.
12+
1213
### Run prefill servers
1314
Prefiller1 Launch Command:
1415
```bash
1516
export PYTHONHASHSEED=123456
16-
CUDA_VISIBLE_DEVICES=0 vllm serve /home/models/Qwen2.5-7B-Instruct \
17+
export CUDA_VISIBLE_DEVICES=0
18+
vllm serve /home/models/Qwen2.5-7B-Instruct \
1719
--max-model-len 20000 \
1820
--tensor-parallel-size 1 \
1921
--gpu_memory_utilization 0.87 \
@@ -40,7 +42,8 @@ CUDA_VISIBLE_DEVICES=0 vllm serve /home/models/Qwen2.5-7B-Instruct \
4042
Prefiller2 Launch Command:
4143
```bash
4244
export PYTHONHASHSEED=123456
43-
CUDA_VISIBLE_DEVICES=1 vllm serve /home/models/Qwen2.5-7B-Instruct \
45+
export CUDA_VISIBLE_DEVICES=1
46+
vllm serve /home/models/Qwen2.5-7B-Instruct \
4447
--max-model-len 20000 \
4548
--tensor-parallel-size 1 \
4649
--gpu_memory_utilization 0.87 \
@@ -68,7 +71,8 @@ CUDA_VISIBLE_DEVICES=1 vllm serve /home/models/Qwen2.5-7B-Instruct \
6871
Decoder1 Launch Command:
6972
```bash
7073
export PYTHONHASHSEED=123456
71-
CUDA_VISIBLE_DEVICES=2 vllm serve /home/models/Qwen2.5-7B-Instruct \
74+
export CUDA_VISIBLE_DEVICES=2
75+
vllm serve /home/models/Qwen2.5-7B-Instruct \
7276
--max-model-len 20000 \
7377
--tensor-parallel-size 1 \
7478
--gpu_memory_utilization 0.87 \
@@ -94,7 +98,8 @@ CUDA_VISIBLE_DEVICES=2 vllm serve /home/models/Qwen2.5-7B-Instruct \
9498
Decoder2 Launch Command:
9599
```bash
96100
export PYTHONHASHSEED=123456
97-
CUDA_VISIBLE_DEVICES=3 vllm serve /home/models/Qwen2.5-7B-Instruct \
101+
export CUDA_VISIBLE_DEVICES=3
102+
vllm serve /home/models/Qwen2.5-7B-Instruct \
98103
--max-model-len 20000 \
99104
--tensor-parallel-size 1 \
100105
--gpu_memory_utilization 0.87 \
@@ -121,7 +126,7 @@ CUDA_VISIBLE_DEVICES=3 vllm serve /home/models/Qwen2.5-7B-Instruct \
121126
### Run proxy server
122127
Make sure prefill nodes and decode nodes can connect to each other. the number of prefill/decode hosts should be equal to the number of prefill/decode ports.
123128
```bash
124-
cd vllm-workspace/unified-cache-management/ucm/pd
129+
cd /vllm-workspace/unified-cache-management/ucm/pd
125130
python3 toy_proxy_server.py --pd-disaggregation --host localhost --port 7805 --prefiller-hosts <prefill-node-ip-1> <prefill-node-ip-2> --prefiller-port 7800 7801 --decoder-hosts <decoder-node-ip-1> <decoder-node-ip-2> --decoder-ports 7802 7803
126131
```
127132

@@ -141,8 +146,7 @@ curl http://localhost:7805/v1/completions \
141146
### Benchmark Test
142147
Use the benchmark scripts provided by vLLM.
143148
```bash
144-
cd /vllm-workspace/vllm/benchmarks
145-
python3 benchmark_serving.py \
149+
vllm bench serve \
146150
--backend vllm \
147151
--dataset-name random \
148152
--random-input-len 4096 \

0 commit comments

Comments
 (0)