Skip to content

Commit b3db973

Browse files
authored
Merge pull request #113 from VectorInstitute/feature/batch-mode
Added functionality to launch models in a batch, currently limited to single node models as multi-node model launches present some communication challenges.
2 parents 0c83309 + 6bd09da commit b3db973

22 files changed

+2762
-888
lines changed

README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -103,10 +103,11 @@ export VEC_INF_CONFIG=/h/<username>/my-model-config.yaml
103103

104104
#### Other commands
105105

106-
* `status`: Check the model status by providing its Slurm job ID, `--json-mode` supported.
106+
* `batch-launch`: Launch multiple model inference servers at once, currently ONLY single node models supported,
107+
* `status`: Check the model status by providing its Slurm job ID.
107108
* `metrics`: Streams performance metrics to the console.
108109
* `shutdown`: Shutdown a model by providing its Slurm job ID.
109-
* `list`: List all available model names, or view the default/cached configuration of a specific model, `--json-mode` supported.
110+
* `list`: List all available model names, or view the default/cached configuration of a specific model.
110111
* `cleanup`: Remove old log directories. You can filter by `--model-family`, `--model-name`, `--job-id`, and/or `--before-job-id`. Use `--dry-run` to preview what would be deleted.
111112

112113
For more details on the usage of these commands, refer to the [User Guide](https://vectorinstitute.github.io/vector-inference/user_guide/)

docs/user_guide.md

Lines changed: 51 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
### `launch` command
66

7-
The `launch` command allows users to deploy a model as a slurm job. If the job successfully launches, a URL endpoint is exposed for the user to send requests for inference.
7+
The `launch` command allows users to launch a OpenAI-compatible model inference server as a slurm job. If the job successfully launches, a URL endpoint is exposed for the user to send requests for inference.
88

99
We will use the Llama 3.1 model as example, to launch an OpenAI compatible inference server for Meta-Llama-3.1-8B-Instruct, run:
1010

@@ -97,6 +97,53 @@ export VEC_INF_CONFIG=/h/<username>/my-model-config.yaml
9797
* For GPU partitions with non-Ampere architectures, e.g. `rtx6000`, `t4v2`, BF16 isn't supported. For models that have BF16 as the default type, when using a non-Ampere GPU, use FP16 instead, i.e. `--dtype: float16`.
9898
* Setting `--compilation-config` to `3` currently breaks multi-node model launches, so we don't set them for models that require multiple nodes of GPUs.
9999

100+
### `batch-launch` command
101+
102+
The `batch-launch` command allows users to launch multiple inference servers at once, here is an example of launching 2 models:
103+
104+
```bash
105+
vec-inf batch-launch DeepSeek-R1-Distill-Qwen-7B Qwen2.5-Math-PRM-7B
106+
```
107+
108+
You should see an output like the following:
109+
110+
```
111+
┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
112+
┃ Job Config ┃ Value ┃
113+
┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
114+
│ Slurm Job ID │ 17480109 │
115+
│ Slurm Job Name │ BATCH-DeepSeek-R1-Distill-Qwen-7B-Qwen2.5-Math-PRM-7B │
116+
│ Model Name │ DeepSeek-R1-Distill-Qwen-7B │
117+
│ Partition │ a40 │
118+
│ QoS │ m2 │
119+
│ Time Limit │ 08:00:00 │
120+
│ Num Nodes │ 1 │
121+
│ GPUs/Node │ 1 │
122+
│ CPUs/Task │ 16 │
123+
│ Memory/Node │ 64G │
124+
│ Log Directory │ /h/marshallw/.vec-inf-logs/BATCH-DeepSeek-R1-Distill-Qwen-7B-Qwen2.5… │
125+
│ Model Name │ Qwen2.5-Math-PRM-7B │
126+
│ Partition │ a40 │
127+
│ QoS │ m2 │
128+
│ Time Limit │ 08:00:00 │
129+
│ Num Nodes │ 1 │
130+
│ GPUs/Node │ 1 │
131+
│ CPUs/Task │ 16 │
132+
│ Memory/Node │ 64G │
133+
│ Log Directory │ /h/marshallw/.vec-inf-logs/BATCH-DeepSeek-R1-Distill-Qwen-7B-Qwen2.5… │
134+
└────────────────┴─────────────────────────────────────────────────────────────────────────┘
135+
```
136+
137+
The inference servers will begin launching only after all requested resources have been allocated, preventing resource waste. Unlike the `launch` command, `batch-launch` does not accept additional launch parameters from the command line. Users must either:
138+
139+
- Specify a batch launch configuration file using the `--batch-config` option, or
140+
- Ensure model launch configurations are available at the default location (cached config or user-defined `VEC_INF_CONFIG`)
141+
142+
Since batch launches use heterogeneous jobs, users can request different partitions and resource amounts for each model. After launch, you can monitor individual servers using the standard commands (`status`, `metrics`, etc.) by providing the specific Slurm job ID for each server (e.g. 17480109+0, 17480109+1).
143+
144+
**NOTE**
145+
* Currently only models that can fit on a single node (regardless of the node type) is supported, multi-node launches will be available in a future update.
146+
100147
### `status` command
101148

102149
You can check the inference server status by providing the Slurm job ID to the `status` command:
@@ -138,7 +185,9 @@ There are 5 possible states:
138185
* **FAILED**: Inference server in an unhealthy state. Job failed reason will be shown.
139186
* **SHUTDOWN**: Inference server is shutdown/cancelled.
140187

141-
Note that the base URL is only available when model is in `READY` state, and if you've changed the Slurm log directory path, you also need to specify it when using the `status` command.
188+
**Note**
189+
* The base URL is only available when model is in `READY` state.
190+
* For servers launched with `batch-launch`, the job ID should follow the format of "MAIN_JOB_ID+OFFSET" (e.g. 17480109+0, 17480109+1).
142191

143192
### `metrics` command
144193

examples/slurm_dependency/run_downstream.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99

1010
if len(sys.argv) < 2:
1111
raise ValueError("Expected server job ID as the first argument.")
12-
job_id = int(sys.argv[1])
12+
job_id = sys.argv[1]
1313

1414
vi_client = VecInfClient()
1515
print(f"Waiting for SLURM job {job_id} to be ready...")

tests/test_imports.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ def test_imports(self):
2222
import vec_inf.client._exceptions
2323
import vec_inf.client._helper
2424
import vec_inf.client._slurm_script_generator
25+
import vec_inf.client._slurm_templates
2526
import vec_inf.client._utils
2627
import vec_inf.client.api
2728
import vec_inf.client.config

0 commit comments

Comments
 (0)