Skip to content

Commit 3fecdae

Browse files
author
t00939662
committed
[Feat] corrtec the docs
1 parent 79af398 commit 3fecdae

File tree

4 files changed

+10
-13
lines changed

4 files changed

+10
-13
lines changed

docs/source/getting-started/quick_start.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -21,14 +21,16 @@ Before you start with UCM, please make sure that you have installed UCM correctl
2121

2222
## Features Overview
2323

24-
UCM supports two key features: **Prefix Cache** and **Sparse-attention**.
24+
UCM supports two key features: **Prefix Cache** and **Sparse attention**.
2525

2626
Each feature supports both **Offline Inference** and **Online API** modes.
2727

2828
For quick start, just follow the [usage](./quick_start.md) guide below to launch your own inference experience;
2929

30-
For further research, click on the links blow to see more details of each feature:
30+
For further research on Prefix Cache, more details are available via the link below:
3131
- [Prefix Cache](../user-guide/prefix-cache/index.md)
32+
33+
Various Sparse Attention features are now available, try GSA Sparsity via the link below:
3234
- [GSA Sparsity](../user-guide/sparse-attention/gsa.md)
3335

3436
## Usage
@@ -47,7 +49,7 @@ python offline_inference.py
4749

4850
</details>
4951

50-
<details>
52+
<details open>
5153
<summary><b>OpenAI-Compatible Online API</b></summary>
5254

5355
For online inference , vLLM with our connector can also be deployed as a server that implements the OpenAI API protocol.

docs/source/user-guide/pd-disaggregation/1p1d.md

Lines changed: 2 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -8,16 +8,13 @@ This example demonstrates how to run unified-cache-management with disaggregated
88
- Hardware: At least 2 GPUs or 2 NPUs
99

1010
## Start disaggregated service
11-
For illustration purposes, let us assume that the model used is Qwen2.5-7B-Instruct.
11+
For illustration purposes, let us take GPU as an example and assume the model used is Qwen2.5-7B-Instruct.Using ASCEND_RT_VISIBLE_DEVICES instead of CUDA_VISIBLE_DEVICES to specify visible devices when starting service on Ascend platform.
1212

1313
### Run prefill server
1414
Prefiller Launch Command:
1515
```bash
1616
export PYTHONHASHSEED=123456
17-
# For GPU devices, use the following command:
1817
export CUDA_VISIBLE_DEVICES=0
19-
# For NPU devices, use the following command:
20-
export ASCEND_RT_VISIVLE_DEVEICES=0
2118
vllm serve /home/models/Qwen2.5-7B-Instruct \
2219
--max-model-len 20000 \
2320
--tensor-parallel-size 1 \
@@ -45,11 +42,8 @@ vllm serve /home/models/Qwen2.5-7B-Instruct \
4542
### Run decode server
4643
Decoder Launch Command:
4744
```bash
48-
export PYTHONHASHSEED=123456
49-
# For GPU devices, use the following command:
45+
export PYTHONHASHSEED=123456
5046
export CUDA_VISIBLE_DEVICES=0
51-
# For NPU devices, use the following command:
52-
export ASCEND_RT_VISIVLE_DEVEICES=0
5347
vllm serve /home/models/Qwen2.5-7B-Instruct \
5448
--max-model-len 20000 \
5549
--tensor-parallel-size 1 \

docs/source/user-guide/pd-disaggregation/npgd.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,7 @@ vllm serve /home/models/Qwen2.5-7B-Instruct \
7878
### Run proxy server
7979
Make sure prefill nodes and decode nodes can connect to each other.
8080
```bash
81-
cd vllm-workspace/unified-cache-management/ucm/pd
81+
cd /vllm-workspace/unified-cache-management/ucm/pd
8282
python3 toy_proxy_server.py --host localhost --port 7802 --prefiller-host <prefill-node-ip> --prefiller-port 7800 --decoder-host <decode-node-ip> --decoder-port 7801
8383
```
8484

docs/source/user-guide/pd-disaggregation/xpyd.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,8 @@ This example demonstrates how to run unified-cache-management with disaggregated
88
- Hardware: At least 4 GPUs (At least 2 GPUs for prefiller + 2 for decoder in 2d2p setup or 2 NPUs for prefiller + 2 for decoder in 2d2p setup)
99

1010
## Start disaggregated service
11-
For illustration purposes, let us take a GPU as an example and assume the model used is Qwen2.5-7B-Instruct.
11+
For illustration purposes, let us take GPU as an example and assume the model used is Qwen2.5-7B-Instruct.Using ASCEND_RT_VISIBLE_DEVICES instead of CUDA_VISIBLE_DEVICES to specify visible devices when starting service on Ascend platform.
12+
1213
### Run prefill servers
1314
Prefiller1 Launch Command:
1415
```bash

0 commit comments

Comments
 (0)