Skip to content

Commit c4d975c

Browse files
authored
feat(benchmark): Add Procfile to run Guardrails and mock LLMs (#1490)
* Add full path to server uvicorn.run() call * Remove unused files * Add Procfile with the commands to run Guardrails, content safety mock, and APP LLM mock * Add simple script to validate Guardrails and mocks are running correctly * Restructure configs into mocks and Guardrails configs * Add tests against validate_mocks.py * Convert f-string logging statements to %s format * Add tests for validate_mocks.py * Remove script-testing unit-tests * Add pyproject.toml 'benchmark' extra to install honcho and requests * Add workers CLI argument to pass to uvicorn app invocation * Use httpx rather than requests to call and validate mocks * Revert "Add pyproject.toml 'benchmark' extra to install honcho and requests" This reverts commit 20f3726. * Removed commented line in Procfile * Add README explaining core-banchmarking * Small README tweaks
1 parent afb1bf0 commit c4d975c

File tree

9 files changed

+842
-2
lines changed

9 files changed

+842
-2
lines changed

nemoguardrails/benchmark/Procfile

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
# Procfile
2+
3+
# NeMo Guardrails server
4+
gr: poetry run nemoguardrails server --config configs/guardrail_configs --default-config-id content_safety_colang1 --port 9000
5+
6+
# Guardrails NIMs for inference
7+
app_llm: poetry run python mock_llm_server/run_server.py --workers 4 --port 8000 --config-file configs/mock_configs/meta-llama-3.3-70b-instruct.env
8+
cs_llm: poetry run python mock_llm_server/run_server.py --workers 4 --port 8001 --config-file configs/mock_configs/nvidia-llama-3.1-nemoguard-8b-content-safety.env

nemoguardrails/benchmark/README.md

Lines changed: 165 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,165 @@
1+
# Guardrails Benchmarking
2+
3+
NeMo Guardrails includes benchmarking tools to help users capacity-test their Guardrails applications.
4+
Adding guardrails to an LLM-based application improves safety and security, while adding some latency. These benchmarks allow users to quantify the tradeoff between security and latency, to make data-driven decisions.
5+
We currently have a simple testbench, which runs the Guardrails server with mocks as Guardrail and Application models. This can be used for performance-testing on a laptop without any GPUs, and run in a few minutes.
6+
7+
-----
8+
9+
## Guardrails Core Benchmarking
10+
11+
This benchmark measures the performance of the Guardrails application, running on CPU-only laptop or instance.
12+
It doesn't require GPUs on which to run local models, or access to the internet to use models hosted by providers.
13+
All models use the [Mock LLM Server](mock_llm_server), which is a simplified model of an LLM used for inference.
14+
The aim of this benchmark is to detect performance-regressions as quickly as running unit-tests.
15+
16+
## Quickstart: Running Guardrails with Mock LLMs
17+
To run Guardrails with mocks for both the content-safety and main LLM, follow the steps below.
18+
All commands must be run in the `nemoguardrails/benchmark` directory.
19+
These assume you already have a working environment after following the steps in [CONTRIBUTING.md](../../CONTRIBUTING.md).
20+
21+
First, we need to install the honcho and langchain-nvidia-ai-endpoints packages.
22+
The `honcho` package is used to run Procfile-based applications, and is a Python port of [Foreman](https://github.com/ddollar/foreman).
23+
The `langchain-nvidia-ai-endpoints` package is used to communicate with Mock LLMs via Langchain.
24+
25+
```shell
26+
# Install dependencies
27+
$ poetry run pip install honcho langchain-nvidia-ai-endpoints
28+
...
29+
Successfully installed filetype-1.2.0 honcho-2.0.0 langchain-nvidia-ai-endpoints-0.3.19
30+
```
31+
32+
Now we can start up the processes that are part of the [Procfile](Procfile).
33+
As the Procfile processes spin up, they log to the console with a prefix. The `system` prefix is used by Honcho, `app_llm` is the Application or Main LLM mock, `cs_llm` is the content-safety mock, and `gr` is the Guardrails service. We'll explore the Procfile in more detail below.
34+
Once the three 'Uvicorn running on ...' messages are printed, you can move to the next step. Note these messages are likely not on consecutive lines.
35+
36+
```
37+
# All commands must be run in the nemoguardrails/benchmark directory
38+
$ cd nemoguardrails/benchmark
39+
$ poetry run honcho start
40+
13:40:33 system | gr.1 started (pid=93634)
41+
13:40:33 system | app_llm.1 started (pid=93635)
42+
13:40:33 system | cs_llm.1 started (pid=93636)
43+
...
44+
13:40:41 app_llm.1 | INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
45+
...
46+
13:40:41 cs_llm.1 | INFO: Uvicorn running on http://0.0.0.0:8001 (Press CTRL+C to quit)
47+
...
48+
13:40:45 gr.1 | INFO: Uvicorn running on http://0.0.0.0:9000 (Press CTRL+C to quit)
49+
```
50+
51+
Once Guardrails and the mock servers are up, we can use the `validate_mocks.py` script to check they're healthy and serving the correct models.
52+
53+
```shell
54+
$ cd nemoguardrails/benchmark
55+
$ poetry run python validate_mocks.py
56+
Starting LLM endpoint health check...
57+
58+
--- Checking Port: 8000 ---
59+
Checking http://localhost:8000/health ...
60+
HTTP Request: GET http://localhost:8000/health "HTTP/1.1 200 OK"
61+
Health Check PASSED: Status is 'healthy'.
62+
Checking http://localhost:8000/v1/models for 'meta/llama-3.3-70b-instruct'...
63+
HTTP Request: GET http://localhost:8000/v1/models "HTTP/1.1 200 OK"
64+
Model Check PASSED: Found 'meta/llama-3.3-70b-instruct' in model list.
65+
--- Port 8000: ALL CHECKS PASSED ---
66+
67+
--- Checking Port: 8001 ---
68+
Checking http://localhost:8001/health ...
69+
HTTP Request: GET http://localhost:8001/health "HTTP/1.1 200 OK"
70+
Health Check PASSED: Status is 'healthy'.
71+
Checking http://localhost:8001/v1/models for 'nvidia/llama-3.1-nemoguard-8b-content-safety'...
72+
HTTP Request: GET http://localhost:8001/v1/models "HTTP/1.1 200 OK"
73+
Model Check PASSED: Found 'nvidia/llama-3.1-nemoguard-8b-content-safety' in model list.
74+
--- Port 8001: ALL CHECKS PASSED ---
75+
76+
--- Checking Port: 9000 (Rails Config) ---
77+
Checking http://localhost:9000/v1/rails/configs ...
78+
HTTP Request: GET http://localhost:9000/v1/rails/configs "HTTP/1.1 200 OK"
79+
HTTP Status PASSED: Got 200.
80+
Body Check PASSED: Response is an array with at least one entry.
81+
--- Port 9000: ALL CHECKS PASSED ---
82+
83+
--- Final Summary ---
84+
Port 8000 (meta/llama-3.3-70b-instruct): PASSED
85+
Port 8001 (nvidia/llama-3.1-nemoguard-8b-content-safety): PASSED
86+
Port 9000 (Rails Config): PASSED
87+
---------------------
88+
Overall Status: All endpoints are healthy!
89+
```
90+
91+
Once the mocks and Guardrails are running and the script passes, we can issue curl requests against the Guardrails `/chat/completions` endpoint to generate a response and test the system end-to-end.
92+
93+
```shell
94+
curl -s -X POST http://0.0.0.0:9000/v1/chat/completions \
95+
-H 'Accept: application/json' \
96+
-H 'Content-Type: application/json' \
97+
-d '{
98+
"model": "meta/llama-3.3-70b-instruct",
99+
"messages": [
100+
{
101+
"role": "user",
102+
"content": "what can you do for me?"
103+
}
104+
],
105+
"stream": false
106+
}' | jq
107+
{
108+
"messages": [
109+
{
110+
"role": "assistant",
111+
"content": "I can provide information and help with a wide range of topics, from science and history to entertainment and culture. I can also help with language-related tasks, such as translation and text summarization. However, I can't assist with requests that involve harm or illegal activities."
112+
}
113+
]
114+
}
115+
116+
```
117+
118+
------
119+
120+
## Deep-Dive: Configuration
121+
122+
In this section, we'll examine the configuration files used in the quickstart above. This gives more context on how the system works, and can be extended as needed.
123+
124+
### Procfile
125+
126+
The [Procfile](Procfile?raw=true) contains all the processes that make up the application.
127+
The Honcho package reads in this file, starts all the processes, and combines their logs to the console
128+
The `gr` line runs the Guardrails server on port 9000 and sets the default Guardrails configuration as [content_safety_colang1](configs/guardrail_configs/content_safety_colang1?raw=true).
129+
The `app_llm` line runs the Application or Main Mock LLM. Guardrails calls this LLM to generate a response to the user's query. This server uses 4 uvicorn workers and runs on port 8000. The configuration file here is a Mock LLM configuration, not a Guardrails configuration.
130+
The `cs_llm` line runs the Content-Safety Mock LLM. This uses 4 uvicorn workers and runs on port 8001.
131+
132+
### Guardrails Configuration
133+
The [Guardrails Configuration](configs/guardrail_configs/content_safety_colang1/config.yml) is used by the Guardrails server.
134+
Under the `models` section, the `main` model is used to generate responses to the user queries. The base URL for this model is the `app_llm` Mock LLM from the Procfile, running on port 8000. The `model` field has to match the Mock LLM model name.
135+
The `content_safety` model is configured for use in an input and output rail. The `type` field matches the `$model` used in the input and output flows.
136+
137+
### Mock LLM Endpoints
138+
The Mock LLM implements a subset of the OpenAI LLM API.
139+
There are two Mock LLM configurations, one for the Mock [main model](configs/mock_configs/meta-llama-3.3-70b-instruct.env), and another for the Mock [content-safety](configs/mock_configs/nvidia-llama-3.1-nemoguard-8b-content-safety.env) model.
140+
The Mock LLM has the following OpenAI-compatible endpoints:
141+
142+
* `/health`: Returns a JSON object with status set to healthy and timestamp in seconds-since-epoch. For example `{"status":"healthy","timestamp":1762781239}`
143+
* `/v1/models`: Returns the `MODEL` field from the Mock configuration (see below). For example `{"object":"list","data":[{"id":"meta/llama-3.3-70b-instruct","object":"model","created":1762781290,"owned_by":"system"}]}`
144+
* `/v1/completions`: Returns an [OpenAI completion object](https://platform.openai.com/docs/api-reference/completions/object) using the Mock configuration (see below).
145+
* `/v1/chat/completions`: Returns an [OpenAI chat completion object](https://platform.openai.com/docs/api-reference/chat/object) using the Mock configuration (see below).
146+
147+
### Mock LLM Configuration
148+
Mock LLMs are configured using the `.env` file format. These files are passed to the Mock LLM using the `--config-file` argument.
149+
The Mock LLMs return either a `SAFE_TEXT` or `UNSAFE_TEXT` response to `/v1/completions` or `/v1/chat/completions` inference requests.
150+
The probability of the `UNSAFE_TEXT` being returned if given by `UNSAFE_PROBABILITY`.
151+
The latency of each response is also controllable, and works as follows:
152+
153+
* Latency is first sampled from a normal distribution with mean `LATENCY_MEAN_SECONDS` and standard deviation `LATENCY_STD_SECONDS`.
154+
* If the sampled value is less than `LATENCY_MIN_SECONDS`, it is set to `LATENCY_MIN_SECONDS`.
155+
* If the sampled value is less than `LATENCY_MAX_SECONDS`, it is set to `LATENCY_MAX_SECONDS`.
156+
157+
The full list of configuration fields is shown below:
158+
* `MODEL`: The Model name served by the Mock LLM. This will be returned on the `/v1/models` endpoint.
159+
* `UNSAFE_PROBABILITY`: Probability of an unsafe response. This is a probability, and must be in the range [0, 1].
160+
* `UNSAFE_TEXT`: String returned as an unsafe response.
161+
* `SAFE_TEXT`: String returned as a safe response.
162+
* `LATENCY_MIN_SECONDS`: Minimum latency in seconds.
163+
* `LATENCY_MAX_SECONDS`: Maximum latency in seconds.
164+
* `LATENCY_MEAN_SECONDS`: Normal distribution mean from which to sample latency.
165+
* `LATENCY_STD_SECONDS`: Normal distribution standard deviation from which to sample latency.

nemoguardrails/benchmark/mock_llm_server/run_server.py

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,12 @@ def parse_arguments():
7171
parser.add_argument(
7272
"--config-file", help=".env file to configure model", required=True
7373
)
74-
74+
parser.add_argument(
75+
"--workers",
76+
type=int,
77+
default=1,
78+
help="Number of uvicorn worker processes (default: 1)",
79+
)
7580
return parser.parse_args()
7681

7782

@@ -104,12 +109,13 @@ def main(): # pragma: no cover
104109

105110
try:
106111
uvicorn.run(
107-
"api:app",
112+
"nemoguardrails.benchmark.mock_llm_server.api:app",
108113
host=args.host,
109114
port=args.port,
110115
reload=args.reload,
111116
log_level=args.log_level,
112117
env_file=config_file,
118+
workers=args.workers,
113119
)
114120
except KeyboardInterrupt:
115121
log.info("\nServer stopped by user")

0 commit comments

Comments
 (0)