|
4 | 4 |
|
5 | 5 | ### `launch` command |
6 | 6 |
|
7 | | -The `launch` command allows users to deploy a model as a slurm job. If the job successfully launches, a URL endpoint is exposed for the user to send requests for inference. |
| 7 | +The `launch` command allows users to launch a OpenAI-compatible model inference server as a slurm job. If the job successfully launches, a URL endpoint is exposed for the user to send requests for inference. |
8 | 8 |
|
9 | 9 | We will use the Llama 3.1 model as example, to launch an OpenAI compatible inference server for Meta-Llama-3.1-8B-Instruct, run: |
10 | 10 |
|
@@ -97,6 +97,53 @@ export VEC_INF_CONFIG=/h/<username>/my-model-config.yaml |
97 | 97 | * For GPU partitions with non-Ampere architectures, e.g. `rtx6000`, `t4v2`, BF16 isn't supported. For models that have BF16 as the default type, when using a non-Ampere GPU, use FP16 instead, i.e. `--dtype: float16`. |
98 | 98 | * Setting `--compilation-config` to `3` currently breaks multi-node model launches, so we don't set them for models that require multiple nodes of GPUs. |
99 | 99 |
|
| 100 | +### `batch-launch` command |
| 101 | + |
| 102 | +The `batch-launch` command allows users to launch multiple inference servers at once, here is an example of launching 2 models: |
| 103 | + |
| 104 | +```bash |
| 105 | +vec-inf batch-launch DeepSeek-R1-Distill-Qwen-7B Qwen2.5-Math-PRM-7B |
| 106 | +``` |
| 107 | + |
| 108 | +You should see an output like the following: |
| 109 | + |
| 110 | +``` |
| 111 | +┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ |
| 112 | +┃ Job Config ┃ Value ┃ |
| 113 | +┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ |
| 114 | +│ Slurm Job ID │ 17480109 │ |
| 115 | +│ Slurm Job Name │ BATCH-DeepSeek-R1-Distill-Qwen-7B-Qwen2.5-Math-PRM-7B │ |
| 116 | +│ Model Name │ DeepSeek-R1-Distill-Qwen-7B │ |
| 117 | +│ Partition │ a40 │ |
| 118 | +│ QoS │ m2 │ |
| 119 | +│ Time Limit │ 08:00:00 │ |
| 120 | +│ Num Nodes │ 1 │ |
| 121 | +│ GPUs/Node │ 1 │ |
| 122 | +│ CPUs/Task │ 16 │ |
| 123 | +│ Memory/Node │ 64G │ |
| 124 | +│ Log Directory │ /h/marshallw/.vec-inf-logs/BATCH-DeepSeek-R1-Distill-Qwen-7B-Qwen2.5… │ |
| 125 | +│ Model Name │ Qwen2.5-Math-PRM-7B │ |
| 126 | +│ Partition │ a40 │ |
| 127 | +│ QoS │ m2 │ |
| 128 | +│ Time Limit │ 08:00:00 │ |
| 129 | +│ Num Nodes │ 1 │ |
| 130 | +│ GPUs/Node │ 1 │ |
| 131 | +│ CPUs/Task │ 16 │ |
| 132 | +│ Memory/Node │ 64G │ |
| 133 | +│ Log Directory │ /h/marshallw/.vec-inf-logs/BATCH-DeepSeek-R1-Distill-Qwen-7B-Qwen2.5… │ |
| 134 | +└────────────────┴─────────────────────────────────────────────────────────────────────────┘ |
| 135 | +``` |
| 136 | + |
| 137 | +The inference servers will begin launching only after all requested resources have been allocated, preventing resource waste. Unlike the `launch` command, `batch-launch` does not accept additional launch parameters from the command line. Users must either: |
| 138 | + |
| 139 | +- Specify a batch launch configuration file using the `--batch-config` option, or |
| 140 | +- Ensure model launch configurations are available at the default location (cached config or user-defined `VEC_INF_CONFIG`) |
| 141 | + |
| 142 | +Since batch launches use heterogeneous jobs, users can request different partitions and resource amounts for each model. After launch, you can monitor individual servers using the standard commands (`status`, `metrics`, etc.) by providing the specific Slurm job ID for each server (e.g. 17480109+0, 17480109+1). |
| 143 | + |
| 144 | +**NOTE** |
| 145 | +* Currently only models that can fit on a single node (regardless of the node type) is supported, multi-node launches will be available in a future update. |
| 146 | + |
100 | 147 | ### `status` command |
101 | 148 |
|
102 | 149 | You can check the inference server status by providing the Slurm job ID to the `status` command: |
@@ -138,7 +185,9 @@ There are 5 possible states: |
138 | 185 | * **FAILED**: Inference server in an unhealthy state. Job failed reason will be shown. |
139 | 186 | * **SHUTDOWN**: Inference server is shutdown/cancelled. |
140 | 187 |
|
141 | | -Note that the base URL is only available when model is in `READY` state, and if you've changed the Slurm log directory path, you also need to specify it when using the `status` command. |
| 188 | +**Note** |
| 189 | +* The base URL is only available when model is in `READY` state. |
| 190 | +* For servers launched with `batch-launch`, the job ID should follow the format of "MAIN_JOB_ID+OFFSET" (e.g. 17480109+0, 17480109+1). |
142 | 191 |
|
143 | 192 | ### `metrics` command |
144 | 193 |
|
|
0 commit comments