Skip to content

Commit f716ee0

Browse files
authored
feat(benchmark): AIPerf run script (#1501)
* WIP: First round of performance regressions are working * Remove the rampup_seconds calculation, use warmup request count instead * Add benchmark directory to pyright type-checking * Add aiperf typer to the top-level nemoguardrails app * Add a quick GET to the /v1/models endpoint before running any benchamrks * Add tests for aiperf Pydantic models * Rename benchmark_seconds to benchmark_duration, add tokenizer optional field * Add single-concurrency config, rename both * Change configs to use NVCF hosted Llama 3.3 70B model * Refactor single and sweep benchmark runs, add API key env var logic to get environment variable * Add tests for run_aiperf.py * Revert changes to llm/providers/huggingface/streamers.py * Address greptile feedback * Add README for AIPerf scripts * Fix hard-coded forward-slash in path name * Fix hard-coded forward-slash in path name * Fix type: ignore line in huggingface streamers * Remove content_safety_colang2 Guardrail config from benchmark * Fix TextStreamer import Pyright waiver * Revert changes to server to give OpenAI-compliant responses * Add API key to /v1/models check, adjust description of AIPerf in CLI * Revert server changes * Address PR feedback * Move aiperf code to top-level * Update tests for new aiperf location * Rename configs directory * Create self-contained typer app, update README with new commands to run it * Rebase onto develop and re-run ruff formatter * Move aiperf under benchmark dir * Move aiperf under benchmark dir
1 parent d3fb3d6 commit f716ee0

File tree

10 files changed

+2626
-0
lines changed

10 files changed

+2626
-0
lines changed

benchmark/aiperf/README.md

Lines changed: 365 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,365 @@
1+
# AIPerf Benchmarking for NeMo Guardrails
2+
3+
## Introduction
4+
5+
[AIPerf](https://github.com/ai-dynamo/aiperf) is NVIDIA's latest benchmarking tool for LLMs. It supports any OpenAI-compatible inference service and generates synthetic data loads, benchmarks, and all the metrics needed for performance comparison and analysis.
6+
7+
The [`run_aiperf.py`](run_aiperf.py) script enhances AIPerf's capabilities by providing:
8+
9+
- **Batch Execution**: Run multiple benchmarks in sequence with a single command
10+
- **Parameter Sweeps**: Automatically generate and run benchmarks across different parameter combinations (e.g., sweeping concurrency levels, token counts, etc.)
11+
- **Organized Results**: Automatically organizes benchmark results in timestamped directories with clear naming conventions
12+
- **YAML Configuration**: Simple, declarative configuration files for reproducible benchmark runs
13+
- **Run Metadata**: Saves complete metadata about each run (configuration, command, timestamp) for future analysis and reproduction
14+
- **Service Health Checks**: Validates that the target service is available before starting benchmarks
15+
16+
Instead of manually running AIPerf multiple times with different parameters, you can define a sweep in a YAML file and let the script handle the rest.
17+
18+
## Getting Started
19+
20+
### Prerequisites
21+
22+
These steps have been tested with Python 3.11.11.
23+
To use the provided configurations, you need to create accounts at https://build.nvidia.com/ and [Huggingface](https://huggingface.co/).
24+
* The provided configurations use models hosted at https://build.nvidia.com/, you'll need to create a Personal API Key to access the models.
25+
* The provided AIperf configurations require the [Meta Llama 3.3 70B Instruct tokenizer](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) to calculate token-counts.
26+
27+
1. **Create a virtual environment in which to install AIPerf**
28+
29+
```bash
30+
$ mkdir ~/env
31+
$ python -m venv ~/env/aiperf
32+
```
33+
34+
2. **Install dependencies in the virtual environment**
35+
36+
```bash
37+
$ pip install aiperf huggingface_hub typer
38+
```
39+
40+
3. ** Login to Hugging Face:**
41+
42+
```bash
43+
huggingface-cli login
44+
```
45+
46+
4. ** Set NVIDIA API Key:**
47+
48+
The provided configs use models hosted on [build.nvidia.com](https://build.nvidia.com/).
49+
To access these, [create an account](https://build.nvidia.com/), and create a Personal API Key.
50+
After creating a Personal API key, set the `NVIDIA_API_KEY` variable as below.
51+
52+
```bash
53+
$ export NVIDIA_API_KEY="your-api-key-here"
54+
```
55+
56+
## Running Benchmarks
57+
58+
Each benchmark is configured using the `AIPerfConfig` Pydantic model in [aiperf_models.py](aiperf_models.py).
59+
The configs are stored in YAML files, and converted to an `AIPerfConfig` object.
60+
There are two example configs included which can be extended for your use-cases. These both use Nvidia-hosted models, :
61+
62+
- [`single_concurrency.yaml`](aiperf_configs/single_concurrency.yaml): Example single-run benchmark with a single concurrency value.
63+
- [`sweep_concurrency.yaml`](aiperf_configs/sweep_concurrency.yaml): Example multiple-run benchmark to sweep concurency values and run a new benchmark for each.
64+
65+
To run a benchmark, use the following command:
66+
67+
```bash
68+
$ python -m benchmark.aiperf --config-file <path-to-config.yaml>
69+
```
70+
71+
### Running a Single Benchmark
72+
73+
To run a single benchmark with fixed parameters, use the `single_concurrency.yaml` configuration:
74+
75+
```bash
76+
$ python -m benchmark.aiperf --config-file aiperf/configs/single_concurrency.yaml
77+
```
78+
79+
**Example output:**
80+
81+
```text
82+
2025-12-01 10:35:17 INFO: Running AIPerf with configuration: aiperf/configs/single_concurrency.yaml
83+
2025-12-01 10:35:17 INFO: Results root directory: aiperf_results/single_concurrency/20251201_103517
84+
2025-12-01 10:35:17 INFO: Sweeping parameters: None
85+
2025-12-01 10:35:17 INFO: Running AIPerf with configuration: aiperf/configs/single_concurrency.yaml
86+
2025-12-01 10:35:17 INFO: Output directory: aiperf_results/single_concurrency/20251201_103517
87+
2025-12-01 10:35:17 INFO: Single Run
88+
2025-12-01 10:36:54 INFO: Run completed successfully
89+
2025-12-01 10:36:54 INFO: SUMMARY
90+
2025-12-01 10:36:54 INFO: Total runs : 1
91+
2025-12-01 10:36:54 INFO: Completed : 1
92+
2025-12-01 10:36:54 INFO: Failed : 0
93+
```
94+
95+
### Running a Concurrency Sweep
96+
97+
To run multiple benchmarks with different concurrency levels, use the `sweep_concurrency.yaml` configuration as below:
98+
99+
```bash
100+
$ python -m benchmark.aiperf --config-file aiperf/configs/sweep_concurrency.yaml
101+
```
102+
103+
**Example output:**
104+
105+
```text
106+
2025-11-14 14:02:54 INFO: Running AIPerf with configuration: nemoguardrails/benchmark/aiperf/aiperf_configs/sweep_concurrency.yaml
107+
2025-11-14 14:02:54 INFO: Results root directory: aiperf_results/sweep_concurrency/20251114_140254
108+
2025-11-14 14:02:54 INFO: Sweeping parameters: {'concurrency': [1, 2, 4]}
109+
2025-11-14 14:02:54 INFO: Running 3 benchmarks
110+
2025-11-14 14:02:54 INFO: Run 1/3
111+
2025-11-14 14:02:54 INFO: Sweep parameters: {'concurrency': 1}
112+
2025-11-14 14:04:12 INFO: Run 1 completed successfully
113+
2025-11-14 14:04:12 INFO: Run 2/3
114+
2025-11-14 14:04:12 INFO: Sweep parameters: {'concurrency': 2}
115+
2025-11-14 14:05:25 INFO: Run 2 completed successfully
116+
2025-11-14 14:05:25 INFO: Run 3/3
117+
2025-11-14 14:05:25 INFO: Sweep parameters: {'concurrency': 4}
118+
2025-11-14 14:06:38 INFO: Run 3 completed successfully
119+
2025-11-14 14:06:38 INFO: SUMMARY
120+
2025-11-14 14:06:38 INFO: Total runs : 3
121+
2025-11-14 14:06:38 INFO: Completed : 3
122+
2025-11-14 14:06:38 INFO: Failed : 0
123+
```
124+
125+
## Additional Options
126+
127+
### AIPerf run options
128+
129+
The `--dry-run` option allows you to preview all benchmark commands without executing them. This is useful for:
130+
131+
- Validating your configuration file
132+
- Checking which parameter combinations will be generated
133+
- Estimating total execution time before committing to a long-running sweep
134+
- Debugging configuration issues
135+
136+
```bash
137+
$ python -m benchmark.aiperf --config-file aiperf/configs/sweep_concurrency.yaml --dry-run
138+
```
139+
140+
When in dry-run mode, the script will:
141+
142+
- Load and validate your configuration
143+
- Check service connectivity
144+
- Generate all sweep combinations
145+
- Display what would be executed
146+
- Exit without running any benchmarks
147+
148+
### Verbose Mode
149+
150+
The `--verbose` option outputs more detailed debugging information to understand each step of the benchmarking process.
151+
152+
```bash
153+
$ python -m benchmark.aiperf --config-file <config.yaml> --verbose
154+
```
155+
156+
Verbose mode provides:
157+
158+
- Complete command-line arguments passed to AIPerf
159+
- Detailed parameter merging logic (base config + sweep params)
160+
- Output directory creation details
161+
- Real-time AIPerf output (normally captured to files)
162+
- Full stack traces for errors
163+
164+
**Tip:** Use verbose mode when debugging configuration issues or when you want to see live progress of the benchmark execution.
165+
166+
## Configuration Files
167+
168+
Configuration files are YAML files located in [aiperf_configs](aiperf_configs). The configuration is validated using Pydantic models to catch errors early.
169+
170+
### Top-Level Configuration Fields
171+
172+
| Field | Type | Required | Description |
173+
|-------|------|----------|-------------|
174+
| `batch_name` | string | Yes | Name for this batch of benchmarks. Used in output directory naming (e.g., `aiperf_results/batch_name/timestamp/`) |
175+
| `output_base_dir` | string | Yes | Base directory where all benchmark results will be stored |
176+
| `base_config` | object | Yes | Base configuration parameters applied to all benchmark runs (see below) |
177+
| `sweeps` | object | No | Optional parameter sweeps for running multiple benchmarks with different values |
178+
179+
### Base Configuration Parameters
180+
181+
The `base_config` section contains parameters that are passed to AIPerf. Any of these can be overridden by sweep parameters.
182+
183+
#### Model and Service Configuration
184+
185+
| Parameter | Type | Required | Description |
186+
|-----------|------|----------|-------------|
187+
| `model` | string | Yes | Model identifier (e.g., `meta/llama-3.3-70b-instruct`) |
188+
| `tokenizer` | string | No | Tokenizer name from Hugging Face or local path. If not provided, AIPerf will attempt to use the model name |
189+
| `url` | string | Yes | Base URL of the inference service (e.g., `https://integrate.api.nvidia.com`) |
190+
| `endpoint` | string | No | API endpoint path (default: `/v1/chat/completions`) |
191+
| `endpoint_type` | string | No | Type of endpoint: `chat` or `completions` (default: `chat`) |
192+
| `api_key_env_var` | string | No | Name of environment variable containing API key (e.g., `NVIDIA_API_KEY`) |
193+
| `streaming` | boolean | No | Whether to use streaming mode (default: `false`) |
194+
195+
#### Load Generation Settings
196+
197+
| Parameter | Type | Required | Description |
198+
|-----------|------|----------|-------------|
199+
| `warmup_request_count` | integer | Yes | Number of warmup requests to send before starting the benchmark |
200+
| `benchmark_duration` | integer | Yes | Duration of the benchmark in seconds |
201+
| `concurrency` | integer | Yes | Number of concurrent requests to maintain during the benchmark |
202+
| `request_rate` | float | No | Target request rate in requests/second. If not provided, calculated from concurrency |
203+
| `request_rate_mode` | string | No | Distribution mode: `constant` or `poisson` (default: `constant`) |
204+
205+
#### Synthetic Data Generation
206+
207+
These parameters control the generation of synthetic prompts for benchmarking:
208+
209+
| Parameter | Type | Required | Description |
210+
|-----------|------|----------|-------------|
211+
| `random_seed` | integer | No | Random seed for reproducible synthetic data generation |
212+
| `prompt_input_tokens_mean` | integer | No | Mean number of input tokens per prompt |
213+
| `prompt_input_tokens_stddev` | integer | No | Standard deviation of input token count |
214+
| `prompt_output_tokens_mean` | integer | No | Mean number of expected output tokens |
215+
| `prompt_output_tokens_stddev` | integer | No | Standard deviation of output token count |
216+
217+
### Parameter Sweeps
218+
219+
The `sweeps` section allows you to run multiple benchmarks with different parameter values. The script generates a **Cartesian product** of all sweep values, running a separate benchmark for each combination.
220+
221+
#### Basic Sweep Example
222+
223+
```yaml
224+
sweeps:
225+
concurrency: [1, 2, 4, 8, 16]
226+
```
227+
228+
This will run 5 benchmarks, one for each concurrency level.
229+
230+
#### Multi-Parameter Sweep Example
231+
232+
```yaml
233+
sweeps:
234+
concurrency: [1, 4, 16]
235+
prompt_input_tokens_mean: [100, 500, 1000]
236+
```
237+
238+
This will run **9 benchmarks**, one for each value of `concurrency` and `prompt_input_tokens_mean`.
239+
240+
Each sweep combination creates a subdirectory named with the parameter values:
241+
242+
```text
243+
aiperf_results/
244+
└── my_benchmark/
245+
└── 20251114_140254/
246+
├── concurrency1_prompt_input_tokens_mean100/
247+
├── concurrency1_prompt_input_tokens_mean500/
248+
├── concurrency4_prompt_input_tokens_mean100/
249+
└── ...
250+
```
251+
252+
### Complete Configuration Example
253+
254+
```yaml
255+
# Name for this batch of benchmarks
256+
batch_name: my_benchmark
257+
258+
# Base directory where all benchmark results will be stored
259+
output_base_dir: aiperf_results
260+
261+
# Base configuration applied to all benchmark runs
262+
base_config:
263+
# Model and service configuration
264+
model: meta/llama-3.3-70b-instruct
265+
tokenizer: meta-llama/Llama-3.3-70B-Instruct
266+
url: "https://integrate.api.nvidia.com"
267+
endpoint: "/v1/chat/completions"
268+
endpoint_type: chat
269+
api_key_env_var: NVIDIA_API_KEY
270+
streaming: true
271+
272+
# Load generation settings
273+
warmup_request_count: 20
274+
benchmark_duration: 60
275+
concurrency: 1
276+
request_rate_mode: "constant"
277+
278+
# Synthetic data generation
279+
random_seed: 12345
280+
prompt_input_tokens_mean: 100
281+
prompt_input_tokens_stddev: 10
282+
prompt_output_tokens_mean: 50
283+
prompt_output_tokens_stddev: 5
284+
285+
# Optional: parameter sweeps (Cartesian product)
286+
sweeps:
287+
concurrency: [1, 2, 4, 8, 16]
288+
prompt_input_tokens_mean: [100, 500, 1000]
289+
```
290+
291+
### Common Sweep Patterns
292+
293+
#### Concurrency Scaling Test
294+
295+
```yaml
296+
sweeps:
297+
concurrency: [1, 2, 4, 8, 16, 32, 64]
298+
```
299+
300+
Useful for finding optimal concurrency levels and throughput limits.
301+
302+
#### Token Length Impact Test
303+
304+
```yaml
305+
sweeps:
306+
prompt_input_tokens_mean: [50, 100, 500, 1000, 2000]
307+
prompt_output_tokens_mean: [50, 100, 500, 1000]
308+
```
309+
310+
Useful for understanding how token counts affect latency and throughput.
311+
312+
#### Request Rate Comparison
313+
314+
```yaml
315+
sweeps:
316+
request_rate_mode: ["constant", "poisson"]
317+
concurrency: [4, 8, 16]
318+
```
319+
320+
Useful for comparing different load patterns.
321+
322+
## Output Structure
323+
324+
Results are organized in timestamped directories:
325+
326+
```text
327+
aiperf_results/
328+
├── <batch_name>/
329+
│ └── <timestamp>/
330+
│ ├── run_metadata.json # Single run
331+
│ ├── process_result.json
332+
│ └── <aiperf_outputs>
333+
│ # OR for sweeps:
334+
│ ├── concurrency1/
335+
│ │ ├── run_metadata.json
336+
│ │ ├── process_result.json
337+
│ │ └── <aiperf_outputs>
338+
│ ├── concurrency2/
339+
│ │ └── ...
340+
│ └── concurrency4/
341+
│ └── ...
342+
```
343+
344+
### Output Files
345+
346+
Each run directory contains multiple files with benchmark results and metadata. A summary of these is shown below:
347+
348+
#### Benchmark runner files
349+
350+
- **`run_metadata.json`**: Contains complete metadata about the benchmark run for reproducibility.
351+
- **`process_result.json`**: Contains the subprocess execution results.
352+
353+
#### Files Generated by AIPerf
354+
355+
- **`inputs.json`**: Synthetic prompt data generated for the benchmark.
356+
- **`profile_export_aiperf.json`**: Main metrics file in JSON format containing aggregated statistics.
357+
- **`profile_export_aiperf.csv`**: Same metrics as the JSON file, but in CSV format for easy import into spreadsheet tools or data analysis libraries.
358+
- **`profile_export.jsonl`**: JSON Lines format file containing per-request metrics. Each line is a complete JSON object for one request with:
359+
- **`logs/aiperf.log`**: Detailed log file from AIPerf execution containing:
360+
361+
## Resources
362+
363+
- [AIPerf GitHub Repository](https://github.com/triton-inference-server/perf_analyzer/tree/main/genai-perf)
364+
- [AIPerf Documentation](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/client/src/c%2B%2B/perf_analyzer/genai-perf/README.html)
365+
- [NVIDIA API Catalog](https://build.nvidia.com/)

benchmark/aiperf/__init__.py

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.

0 commit comments

Comments
 (0)