|
| 1 | +# AIPerf Benchmarking for NeMo Guardrails |
| 2 | + |
| 3 | +## Introduction |
| 4 | + |
| 5 | +[AIPerf](https://github.com/ai-dynamo/aiperf) is NVIDIA's latest benchmarking tool for LLMs. It supports any OpenAI-compatible inference service and generates synthetic data loads, benchmarks, and all the metrics needed for performance comparison and analysis. |
| 6 | + |
| 7 | +The [`run_aiperf.py`](run_aiperf.py) script enhances AIPerf's capabilities by providing: |
| 8 | + |
| 9 | +- **Batch Execution**: Run multiple benchmarks in sequence with a single command |
| 10 | +- **Parameter Sweeps**: Automatically generate and run benchmarks across different parameter combinations (e.g., sweeping concurrency levels, token counts, etc.) |
| 11 | +- **Organized Results**: Automatically organizes benchmark results in timestamped directories with clear naming conventions |
| 12 | +- **YAML Configuration**: Simple, declarative configuration files for reproducible benchmark runs |
| 13 | +- **Run Metadata**: Saves complete metadata about each run (configuration, command, timestamp) for future analysis and reproduction |
| 14 | +- **Service Health Checks**: Validates that the target service is available before starting benchmarks |
| 15 | + |
| 16 | +Instead of manually running AIPerf multiple times with different parameters, you can define a sweep in a YAML file and let the script handle the rest. |
| 17 | + |
| 18 | +## Getting Started |
| 19 | + |
| 20 | +### Prerequisites |
| 21 | + |
| 22 | +These steps have been tested with Python 3.11.11. |
| 23 | +To use the provided configurations, you need to create accounts at https://build.nvidia.com/ and [Huggingface](https://huggingface.co/). |
| 24 | +* The provided configurations use models hosted at https://build.nvidia.com/, you'll need to create a Personal API Key to access the models. |
| 25 | +* The provided AIperf configurations require the [Meta Llama 3.3 70B Instruct tokenizer](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) to calculate token-counts. |
| 26 | + |
| 27 | +1. **Create a virtual environment in which to install AIPerf** |
| 28 | + |
| 29 | + ```bash |
| 30 | + $ mkdir ~/env |
| 31 | + $ python -m venv ~/env/aiperf |
| 32 | + ``` |
| 33 | + |
| 34 | +2. **Install dependencies in the virtual environment** |
| 35 | + |
| 36 | + ```bash |
| 37 | + $ pip install aiperf huggingface_hub typer |
| 38 | + ``` |
| 39 | + |
| 40 | +3. ** Login to Hugging Face:** |
| 41 | + |
| 42 | + ```bash |
| 43 | + huggingface-cli login |
| 44 | + ``` |
| 45 | + |
| 46 | +4. ** Set NVIDIA API Key:** |
| 47 | + |
| 48 | + The provided configs use models hosted on [build.nvidia.com](https://build.nvidia.com/). |
| 49 | + To access these, [create an account](https://build.nvidia.com/), and create a Personal API Key. |
| 50 | + After creating a Personal API key, set the `NVIDIA_API_KEY` variable as below. |
| 51 | + |
| 52 | + ```bash |
| 53 | + $ export NVIDIA_API_KEY="your-api-key-here" |
| 54 | + ``` |
| 55 | + |
| 56 | +## Running Benchmarks |
| 57 | + |
| 58 | +Each benchmark is configured using the `AIPerfConfig` Pydantic model in [aiperf_models.py](aiperf_models.py). |
| 59 | +The configs are stored in YAML files, and converted to an `AIPerfConfig` object. |
| 60 | +There are two example configs included which can be extended for your use-cases. These both use Nvidia-hosted models, : |
| 61 | + |
| 62 | +- [`single_concurrency.yaml`](aiperf_configs/single_concurrency.yaml): Example single-run benchmark with a single concurrency value. |
| 63 | +- [`sweep_concurrency.yaml`](aiperf_configs/sweep_concurrency.yaml): Example multiple-run benchmark to sweep concurency values and run a new benchmark for each. |
| 64 | + |
| 65 | +To run a benchmark, use the following command: |
| 66 | + |
| 67 | +```bash |
| 68 | +$ python -m benchmark.aiperf --config-file <path-to-config.yaml> |
| 69 | +``` |
| 70 | + |
| 71 | +### Running a Single Benchmark |
| 72 | + |
| 73 | +To run a single benchmark with fixed parameters, use the `single_concurrency.yaml` configuration: |
| 74 | + |
| 75 | +```bash |
| 76 | +$ python -m benchmark.aiperf --config-file aiperf/configs/single_concurrency.yaml |
| 77 | +``` |
| 78 | + |
| 79 | +**Example output:** |
| 80 | + |
| 81 | +```text |
| 82 | +2025-12-01 10:35:17 INFO: Running AIPerf with configuration: aiperf/configs/single_concurrency.yaml |
| 83 | +2025-12-01 10:35:17 INFO: Results root directory: aiperf_results/single_concurrency/20251201_103517 |
| 84 | +2025-12-01 10:35:17 INFO: Sweeping parameters: None |
| 85 | +2025-12-01 10:35:17 INFO: Running AIPerf with configuration: aiperf/configs/single_concurrency.yaml |
| 86 | +2025-12-01 10:35:17 INFO: Output directory: aiperf_results/single_concurrency/20251201_103517 |
| 87 | +2025-12-01 10:35:17 INFO: Single Run |
| 88 | +2025-12-01 10:36:54 INFO: Run completed successfully |
| 89 | +2025-12-01 10:36:54 INFO: SUMMARY |
| 90 | +2025-12-01 10:36:54 INFO: Total runs : 1 |
| 91 | +2025-12-01 10:36:54 INFO: Completed : 1 |
| 92 | +2025-12-01 10:36:54 INFO: Failed : 0 |
| 93 | +``` |
| 94 | + |
| 95 | +### Running a Concurrency Sweep |
| 96 | + |
| 97 | +To run multiple benchmarks with different concurrency levels, use the `sweep_concurrency.yaml` configuration as below: |
| 98 | + |
| 99 | +```bash |
| 100 | +$ python -m benchmark.aiperf --config-file aiperf/configs/sweep_concurrency.yaml |
| 101 | +``` |
| 102 | + |
| 103 | +**Example output:** |
| 104 | + |
| 105 | +```text |
| 106 | +2025-11-14 14:02:54 INFO: Running AIPerf with configuration: nemoguardrails/benchmark/aiperf/aiperf_configs/sweep_concurrency.yaml |
| 107 | +2025-11-14 14:02:54 INFO: Results root directory: aiperf_results/sweep_concurrency/20251114_140254 |
| 108 | +2025-11-14 14:02:54 INFO: Sweeping parameters: {'concurrency': [1, 2, 4]} |
| 109 | +2025-11-14 14:02:54 INFO: Running 3 benchmarks |
| 110 | +2025-11-14 14:02:54 INFO: Run 1/3 |
| 111 | +2025-11-14 14:02:54 INFO: Sweep parameters: {'concurrency': 1} |
| 112 | +2025-11-14 14:04:12 INFO: Run 1 completed successfully |
| 113 | +2025-11-14 14:04:12 INFO: Run 2/3 |
| 114 | +2025-11-14 14:04:12 INFO: Sweep parameters: {'concurrency': 2} |
| 115 | +2025-11-14 14:05:25 INFO: Run 2 completed successfully |
| 116 | +2025-11-14 14:05:25 INFO: Run 3/3 |
| 117 | +2025-11-14 14:05:25 INFO: Sweep parameters: {'concurrency': 4} |
| 118 | +2025-11-14 14:06:38 INFO: Run 3 completed successfully |
| 119 | +2025-11-14 14:06:38 INFO: SUMMARY |
| 120 | +2025-11-14 14:06:38 INFO: Total runs : 3 |
| 121 | +2025-11-14 14:06:38 INFO: Completed : 3 |
| 122 | +2025-11-14 14:06:38 INFO: Failed : 0 |
| 123 | +``` |
| 124 | + |
| 125 | +## Additional Options |
| 126 | + |
| 127 | +### AIPerf run options |
| 128 | + |
| 129 | +The `--dry-run` option allows you to preview all benchmark commands without executing them. This is useful for: |
| 130 | + |
| 131 | +- Validating your configuration file |
| 132 | +- Checking which parameter combinations will be generated |
| 133 | +- Estimating total execution time before committing to a long-running sweep |
| 134 | +- Debugging configuration issues |
| 135 | + |
| 136 | +```bash |
| 137 | +$ python -m benchmark.aiperf --config-file aiperf/configs/sweep_concurrency.yaml --dry-run |
| 138 | +``` |
| 139 | + |
| 140 | +When in dry-run mode, the script will: |
| 141 | + |
| 142 | +- Load and validate your configuration |
| 143 | +- Check service connectivity |
| 144 | +- Generate all sweep combinations |
| 145 | +- Display what would be executed |
| 146 | +- Exit without running any benchmarks |
| 147 | + |
| 148 | +### Verbose Mode |
| 149 | + |
| 150 | +The `--verbose` option outputs more detailed debugging information to understand each step of the benchmarking process. |
| 151 | + |
| 152 | +```bash |
| 153 | +$ python -m benchmark.aiperf --config-file <config.yaml> --verbose |
| 154 | +``` |
| 155 | + |
| 156 | +Verbose mode provides: |
| 157 | + |
| 158 | +- Complete command-line arguments passed to AIPerf |
| 159 | +- Detailed parameter merging logic (base config + sweep params) |
| 160 | +- Output directory creation details |
| 161 | +- Real-time AIPerf output (normally captured to files) |
| 162 | +- Full stack traces for errors |
| 163 | + |
| 164 | +**Tip:** Use verbose mode when debugging configuration issues or when you want to see live progress of the benchmark execution. |
| 165 | + |
| 166 | +## Configuration Files |
| 167 | + |
| 168 | +Configuration files are YAML files located in [aiperf_configs](aiperf_configs). The configuration is validated using Pydantic models to catch errors early. |
| 169 | + |
| 170 | +### Top-Level Configuration Fields |
| 171 | + |
| 172 | +| Field | Type | Required | Description | |
| 173 | +|-------|------|----------|-------------| |
| 174 | +| `batch_name` | string | Yes | Name for this batch of benchmarks. Used in output directory naming (e.g., `aiperf_results/batch_name/timestamp/`) | |
| 175 | +| `output_base_dir` | string | Yes | Base directory where all benchmark results will be stored | |
| 176 | +| `base_config` | object | Yes | Base configuration parameters applied to all benchmark runs (see below) | |
| 177 | +| `sweeps` | object | No | Optional parameter sweeps for running multiple benchmarks with different values | |
| 178 | + |
| 179 | +### Base Configuration Parameters |
| 180 | + |
| 181 | +The `base_config` section contains parameters that are passed to AIPerf. Any of these can be overridden by sweep parameters. |
| 182 | + |
| 183 | +#### Model and Service Configuration |
| 184 | + |
| 185 | +| Parameter | Type | Required | Description | |
| 186 | +|-----------|------|----------|-------------| |
| 187 | +| `model` | string | Yes | Model identifier (e.g., `meta/llama-3.3-70b-instruct`) | |
| 188 | +| `tokenizer` | string | No | Tokenizer name from Hugging Face or local path. If not provided, AIPerf will attempt to use the model name | |
| 189 | +| `url` | string | Yes | Base URL of the inference service (e.g., `https://integrate.api.nvidia.com`) | |
| 190 | +| `endpoint` | string | No | API endpoint path (default: `/v1/chat/completions`) | |
| 191 | +| `endpoint_type` | string | No | Type of endpoint: `chat` or `completions` (default: `chat`) | |
| 192 | +| `api_key_env_var` | string | No | Name of environment variable containing API key (e.g., `NVIDIA_API_KEY`) | |
| 193 | +| `streaming` | boolean | No | Whether to use streaming mode (default: `false`) | |
| 194 | + |
| 195 | +#### Load Generation Settings |
| 196 | + |
| 197 | +| Parameter | Type | Required | Description | |
| 198 | +|-----------|------|----------|-------------| |
| 199 | +| `warmup_request_count` | integer | Yes | Number of warmup requests to send before starting the benchmark | |
| 200 | +| `benchmark_duration` | integer | Yes | Duration of the benchmark in seconds | |
| 201 | +| `concurrency` | integer | Yes | Number of concurrent requests to maintain during the benchmark | |
| 202 | +| `request_rate` | float | No | Target request rate in requests/second. If not provided, calculated from concurrency | |
| 203 | +| `request_rate_mode` | string | No | Distribution mode: `constant` or `poisson` (default: `constant`) | |
| 204 | + |
| 205 | +#### Synthetic Data Generation |
| 206 | + |
| 207 | +These parameters control the generation of synthetic prompts for benchmarking: |
| 208 | + |
| 209 | +| Parameter | Type | Required | Description | |
| 210 | +|-----------|------|----------|-------------| |
| 211 | +| `random_seed` | integer | No | Random seed for reproducible synthetic data generation | |
| 212 | +| `prompt_input_tokens_mean` | integer | No | Mean number of input tokens per prompt | |
| 213 | +| `prompt_input_tokens_stddev` | integer | No | Standard deviation of input token count | |
| 214 | +| `prompt_output_tokens_mean` | integer | No | Mean number of expected output tokens | |
| 215 | +| `prompt_output_tokens_stddev` | integer | No | Standard deviation of output token count | |
| 216 | + |
| 217 | +### Parameter Sweeps |
| 218 | + |
| 219 | +The `sweeps` section allows you to run multiple benchmarks with different parameter values. The script generates a **Cartesian product** of all sweep values, running a separate benchmark for each combination. |
| 220 | + |
| 221 | +#### Basic Sweep Example |
| 222 | + |
| 223 | +```yaml |
| 224 | +sweeps: |
| 225 | + concurrency: [1, 2, 4, 8, 16] |
| 226 | +``` |
| 227 | +
|
| 228 | +This will run 5 benchmarks, one for each concurrency level. |
| 229 | +
|
| 230 | +#### Multi-Parameter Sweep Example |
| 231 | +
|
| 232 | +```yaml |
| 233 | +sweeps: |
| 234 | + concurrency: [1, 4, 16] |
| 235 | + prompt_input_tokens_mean: [100, 500, 1000] |
| 236 | +``` |
| 237 | +
|
| 238 | +This will run **9 benchmarks**, one for each value of `concurrency` and `prompt_input_tokens_mean`. |
| 239 | + |
| 240 | +Each sweep combination creates a subdirectory named with the parameter values: |
| 241 | + |
| 242 | +```text |
| 243 | +aiperf_results/ |
| 244 | +└── my_benchmark/ |
| 245 | + └── 20251114_140254/ |
| 246 | + ├── concurrency1_prompt_input_tokens_mean100/ |
| 247 | + ├── concurrency1_prompt_input_tokens_mean500/ |
| 248 | + ├── concurrency4_prompt_input_tokens_mean100/ |
| 249 | + └── ... |
| 250 | +``` |
| 251 | + |
| 252 | +### Complete Configuration Example |
| 253 | + |
| 254 | +```yaml |
| 255 | +# Name for this batch of benchmarks |
| 256 | +batch_name: my_benchmark |
| 257 | +
|
| 258 | +# Base directory where all benchmark results will be stored |
| 259 | +output_base_dir: aiperf_results |
| 260 | +
|
| 261 | +# Base configuration applied to all benchmark runs |
| 262 | +base_config: |
| 263 | + # Model and service configuration |
| 264 | + model: meta/llama-3.3-70b-instruct |
| 265 | + tokenizer: meta-llama/Llama-3.3-70B-Instruct |
| 266 | + url: "https://integrate.api.nvidia.com" |
| 267 | + endpoint: "/v1/chat/completions" |
| 268 | + endpoint_type: chat |
| 269 | + api_key_env_var: NVIDIA_API_KEY |
| 270 | + streaming: true |
| 271 | +
|
| 272 | + # Load generation settings |
| 273 | + warmup_request_count: 20 |
| 274 | + benchmark_duration: 60 |
| 275 | + concurrency: 1 |
| 276 | + request_rate_mode: "constant" |
| 277 | +
|
| 278 | + # Synthetic data generation |
| 279 | + random_seed: 12345 |
| 280 | + prompt_input_tokens_mean: 100 |
| 281 | + prompt_input_tokens_stddev: 10 |
| 282 | + prompt_output_tokens_mean: 50 |
| 283 | + prompt_output_tokens_stddev: 5 |
| 284 | +
|
| 285 | +# Optional: parameter sweeps (Cartesian product) |
| 286 | +sweeps: |
| 287 | + concurrency: [1, 2, 4, 8, 16] |
| 288 | + prompt_input_tokens_mean: [100, 500, 1000] |
| 289 | +``` |
| 290 | + |
| 291 | +### Common Sweep Patterns |
| 292 | + |
| 293 | +#### Concurrency Scaling Test |
| 294 | + |
| 295 | +```yaml |
| 296 | +sweeps: |
| 297 | + concurrency: [1, 2, 4, 8, 16, 32, 64] |
| 298 | +``` |
| 299 | + |
| 300 | +Useful for finding optimal concurrency levels and throughput limits. |
| 301 | + |
| 302 | +#### Token Length Impact Test |
| 303 | + |
| 304 | +```yaml |
| 305 | +sweeps: |
| 306 | + prompt_input_tokens_mean: [50, 100, 500, 1000, 2000] |
| 307 | + prompt_output_tokens_mean: [50, 100, 500, 1000] |
| 308 | +``` |
| 309 | + |
| 310 | +Useful for understanding how token counts affect latency and throughput. |
| 311 | + |
| 312 | +#### Request Rate Comparison |
| 313 | + |
| 314 | +```yaml |
| 315 | +sweeps: |
| 316 | + request_rate_mode: ["constant", "poisson"] |
| 317 | + concurrency: [4, 8, 16] |
| 318 | +``` |
| 319 | + |
| 320 | +Useful for comparing different load patterns. |
| 321 | + |
| 322 | +## Output Structure |
| 323 | + |
| 324 | +Results are organized in timestamped directories: |
| 325 | + |
| 326 | +```text |
| 327 | +aiperf_results/ |
| 328 | +├── <batch_name>/ |
| 329 | +│ └── <timestamp>/ |
| 330 | +│ ├── run_metadata.json # Single run |
| 331 | +│ ├── process_result.json |
| 332 | +│ └── <aiperf_outputs> |
| 333 | +│ # OR for sweeps: |
| 334 | +│ ├── concurrency1/ |
| 335 | +│ │ ├── run_metadata.json |
| 336 | +│ │ ├── process_result.json |
| 337 | +│ │ └── <aiperf_outputs> |
| 338 | +│ ├── concurrency2/ |
| 339 | +│ │ └── ... |
| 340 | +│ └── concurrency4/ |
| 341 | +│ └── ... |
| 342 | +``` |
| 343 | + |
| 344 | +### Output Files |
| 345 | + |
| 346 | +Each run directory contains multiple files with benchmark results and metadata. A summary of these is shown below: |
| 347 | + |
| 348 | +#### Benchmark runner files |
| 349 | + |
| 350 | +- **`run_metadata.json`**: Contains complete metadata about the benchmark run for reproducibility. |
| 351 | +- **`process_result.json`**: Contains the subprocess execution results. |
| 352 | + |
| 353 | +#### Files Generated by AIPerf |
| 354 | + |
| 355 | +- **`inputs.json`**: Synthetic prompt data generated for the benchmark. |
| 356 | +- **`profile_export_aiperf.json`**: Main metrics file in JSON format containing aggregated statistics. |
| 357 | +- **`profile_export_aiperf.csv`**: Same metrics as the JSON file, but in CSV format for easy import into spreadsheet tools or data analysis libraries. |
| 358 | +- **`profile_export.jsonl`**: JSON Lines format file containing per-request metrics. Each line is a complete JSON object for one request with: |
| 359 | +- **`logs/aiperf.log`**: Detailed log file from AIPerf execution containing: |
| 360 | + |
| 361 | +## Resources |
| 362 | + |
| 363 | +- [AIPerf GitHub Repository](https://github.com/triton-inference-server/perf_analyzer/tree/main/genai-perf) |
| 364 | +- [AIPerf Documentation](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/client/src/c%2B%2B/perf_analyzer/genai-perf/README.html) |
| 365 | +- [NVIDIA API Catalog](https://build.nvidia.com/) |
0 commit comments