Skip to content

Commit a229b90

Browse files
authored
Adding recipe for running Qwen models in G4 (#41)
* Adding recipe for running Qwen models in G4 * Updating README.md with G4 details * Using docker images for vLLM * Correct the readme path * Fix link formatting in README.md for Docker installation Fix link formatting in README.md for Docker installation
1 parent 5735ea1 commit a229b90

File tree

2 files changed

+186
-0
lines changed

2 files changed

+186
-0
lines changed

README.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,12 @@ Models | GPU Machine Type
7878
| **DeepSeek R1 671B** | [A4 (NVIDIA B200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4-vms) | vLLM | Inference | GKE | [Link](./inference/a4/single-host-serving/vllm/README.md)
7979
| **DeepSeek R1 671B** | [A4 (NVIDIA B200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4-vms) | SGLang | Inference | GKE | [Link](./inference/a4/single-host-serving/sglang/README.md)
8080

81+
### Inference benchmarks G4
82+
83+
| Models | GPU Machine Type | Framework | Workload Type | Orchestrator | Link to the recipe |
84+
| ---------------- | ---------------- | --------- | ------------------- | ------------ | ------------------ |
85+
| **Qwen3 8B** | [G4 (NVIDIA RTX PRO 6000 Blackwell)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#g4-series) | vLLM | Inference | GCE | [Link](./inference/g4/qwen-8b/single-host-serving/vllm/README.md)
86+
8187
### Checkpointing benchmarks
8288

8389
Models | GPU Machine Type | Framework | Workload Type | Orchestrator | Link to the recipe
Lines changed: 180 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,180 @@
1+
# Single host inference benchmark of Qwen3-8B with vLLM on G4
2+
3+
This recipe shows how to serve and benchmark Qwen3-8B model using [vLLM](https://github.com/vllm-project/vllm) on a single GCP VM with G4 GPUs. vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs. For more information on G4 machine types, see the [GCP documentation](https://cloud.google.com/compute/docs/accelerator-optimized-machines#g4-machine-types).
4+
5+
## Before you begin
6+
7+
### 1. Create a GCP VM with G4 GPUs
8+
9+
First, we will create a Google Cloud Platform (GCP) Virtual Machine (VM) that has the necessary GPU resources.
10+
11+
Make sure you have the following prerequisites:
12+
* [Google Cloud SDK](https://cloud.google.com/sdk/docs/install) is initialized.
13+
* You have a project with a GPU quota. See [Request a quota increase](https://cloud.google.com/docs/quota/view-request#requesting_higher_quota).
14+
* [Enable required APIs](https://console.cloud.google.com/flows/enableapi?apiid=compute.googleapis.com).
15+
16+
The following commands set up environment variables and create a GCE instance. The `MACHINE_TYPE` is set to `g4-standard-48` for a single GPU VM, More information on different machine types can be found in the [GCP documentation](https://docs.cloud.google.com/compute/docs/accelerator-optimized-machines#g4-machine-types). The boot disk is set to 200GB to accommodate the models and dependencies.
17+
18+
```bash
19+
export VM_NAME="${USER}-g4-test"
20+
export PROJECT_ID="your-project-id"
21+
export ZONE="your-zone"
22+
# g4-standard-48 is for a single GPU VM. For a multi-GPU VM (e.g., 8 GPUs), you can use g4-standard-384.
23+
export MACHINE_TYPE="g4-standard-48"
24+
export IMAGE_PROJECT="ubuntu-os-accelerator-images"
25+
export IMAGE_FAMILY="ubuntu-accelerator-2404-amd64-with-nvidia-570"
26+
27+
gcloud compute instances create ${VM_NAME} \
28+
--machine-type=${MACHINE_TYPE} \
29+
--project=${PROJECT_ID} \
30+
--zone=${ZONE} \
31+
--image-project=${IMAGE_PROJECT} \
32+
--image-family=${IMAGE_FAMILY} \
33+
--maintenance-policy=TERMINATE \
34+
--boot-disk-size=200GB
35+
```
36+
37+
### 2. Connect to the VM
38+
39+
Use `gcloud compute ssh` to connect to the newly created instance.
40+
41+
```bash
42+
gcloud compute ssh ${VM_NAME?} --project=${PROJECT_ID?} --zone=${ZONE?}
43+
```
44+
45+
```
46+
# Run NVIDIA smi to verify the driver installation and see the available GPUs.
47+
nvidia-smi
48+
```
49+
50+
## Serve a model
51+
52+
### 1. Install Docker
53+
54+
Before you can serve the model, you need to have Docker installed on your VM. You can follow the official documentation to install Docker on Ubuntu:
55+
[Install Docker Engine on Ubuntu](https://docs.docker.com/engine/install/ubuntu/)
56+
57+
After installing Docker, make sure the Docker daemon is running.
58+
59+
### 2. Install NVIDIA Container Toolkit
60+
61+
To enable Docker containers to access the GPU, you need to install the NVIDIA Container Toolkit. This toolkit allows the container to interact with the NVIDIA driver on the host machine, making the GPU resources available within the container.
62+
63+
You can follow the official NVIDIA documentation to install the container toolkit:
64+
[NVIDIA Container Toolkit Install Guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)
65+
66+
### 3. Install vLLM
67+
68+
We will use the official vLLM docker image. This image comes with vLLM and all its dependencies pre-installed.
69+
70+
To run the vLLM server, you can use the following command:
71+
72+
```bash
73+
sudo docker run \
74+
--runtime nvidia \
75+
--gpus all \
76+
-v ~/.cache/huggingface:/root/.cache/huggingface \
77+
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
78+
-p 8000:8000 \
79+
--ipc=host \
80+
vllm/vllm-openai:latest \
81+
--model nvidia/Qwen3-8B-FP4 \
82+
--kv-cache-dtype fp8 \
83+
--gpu-memory-utilization 0.95
84+
```
85+
86+
Here's a breakdown of the arguments:
87+
- `--runtime nvidia --gpus all`: This makes the NVIDIA GPUs available inside the container.
88+
- `-v ~/.cache/huggingface:/root/.cache/huggingface`: This mounts the Hugging Face cache directory from the host to the container. This is useful for caching downloaded models.
89+
- `--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN"`: This sets the Hugging Face Hub token as an environment variable in the container. This is required for downloading models that require authentication.
90+
- `-p 8000:8000`: This maps port 8000 on the host to port 8000 in the container.
91+
- `--ipc=host`: This allows the container to share the host's IPC namespace, which can improve performance.
92+
- `vllm/vllm-openai:latest`: This is the name of the official vLLM docker image.
93+
- `--model nvidia/Qwen3-8B-FP4`: The model to be served from Hugging Face.
94+
- `--kv-cache-dtype fp8`: Sets the data type for the key-value cache to FP8 to save GPU memory.
95+
- `--gpu-memory-utilization 0.95`: The fraction of GPU memory to be used by vLLM.
96+
97+
For more information on the available engine arguments, you can refer to the [official vLLM documentation](https://docs.vllm.ai/en/latest/configuration/engine_args/), which includes different parallelism strategies that can be used with multi GPU setup.
98+
99+
After running the command, the model will be served. To run the benchmark, you will need to either run the server in the background by appending `&` to the command, or open a new terminal to run the benchmark command.
100+
101+
## Run Benchmarks for Qwen3-8B-FP4
102+
103+
### 1. Server Output
104+
105+
When the server is up and running, you should see output similar to the following.
106+
107+
```
108+
(APIServer pid=XXXXXX) INFO XX-XX XX:XX:XX [launcher.py:XX] Route: /metrics, Methods: GET
109+
(APIServer pid=XXXXXX) INFO: Started server process [XXXXXX]
110+
(APIServer pid=XXXXXX) INFO: Waiting for application startup.
111+
(APIServer pid=XXXXXX) INFO: Application startup complete.
112+
```
113+
114+
### 2. Run the benchmarks
115+
116+
To run the benchmark, you can use the following command:
117+
118+
```bash
119+
sudo docker run \
120+
--runtime nvidia \
121+
--gpus all \
122+
--network="host" \
123+
--entrypoint vllm \
124+
vllm/vllm-openai:latest bench serve \
125+
--model nvidia/Qwen3-8B-FP4 \
126+
--dataset-name random \
127+
--random-input-len 128 \
128+
--random-output-len 2048 \
129+
--request-rate inf \
130+
--num-prompts 100 \
131+
--ignore-eos
132+
```
133+
134+
Here's a breakdown of the arguments:
135+
- `--model nvidia/Qwen3-8B-FP4`: The model to benchmark.
136+
- `--dataset-name random`: The dataset to use for the benchmark. `random` will generate random prompts.
137+
- `--random-input-len 128`: The length of the random input prompts.
138+
- `--random-output-len 2048`: The length of the generated output.
139+
- `--request-rate inf`: The number of requests per second to send. `inf` sends requests as fast as possible.
140+
- `--num-prompts 100`: The total number of prompts to send.
141+
- `--ignore-eos`: A flag to ignore the end-of-sentence token and generate a fixed number of tokens.
142+
143+
### 3. Example output
144+
145+
The output shows various performance metrics of the model, such as throughput and latency.
146+
147+
```bash
148+
============ Serving Benchmark Result ============
149+
Successful requests: XX
150+
Request rate configured (RPS): XX
151+
Benchmark duration (s): XX
152+
Total input tokens: XX
153+
Total generated tokens: XX
154+
Request throughput (req/s): XX
155+
Output token throughput (tok/s): XX
156+
Total Token throughput (tok/s): XX
157+
---------------Time to First Token----------------
158+
Mean TTFT (ms): XX
159+
Median TTFT (ms): XX
160+
P99 TTFT (ms): XX
161+
-----Time per Output Token (excl. 1st token)------
162+
Mean TPOT (ms): XX
163+
Median TPOT (ms): XX
164+
P99 TPOT (ms): XX
165+
---------------Inter-token Latency----------------
166+
Mean ITL (ms): XX
167+
Median ITL (ms): XX
168+
P99 ITL (ms): XX
169+
==================================================
170+
```
171+
172+
## Clean up
173+
174+
### 1. Delete the VM
175+
176+
This command will delete the GCE instance and all its disks.
177+
178+
```bash
179+
gcloud compute instances delete ${VM_NAME?} --zone=${ZONE?} --project=${PROJECT_ID} --quiet --delete-disks=all
180+
```

0 commit comments

Comments
 (0)