|
| 1 | +# Guardrails Benchmarking |
| 2 | + |
| 3 | +NeMo Guardrails includes benchmarking tools to help users capacity-test their Guardrails applications. |
| 4 | +Adding guardrails to an LLM-based application improves safety and security, while adding some latency. These benchmarks allow users to quantify the tradeoff between security and latency, to make data-driven decisions. |
| 5 | +We currently have a simple testbench, which runs the Guardrails server with mocks as Guardrail and Application models. This can be used for performance-testing on a laptop without any GPUs, and run in a few minutes. |
| 6 | + |
| 7 | +----- |
| 8 | + |
| 9 | +## Guardrails Core Benchmarking |
| 10 | + |
| 11 | +This benchmark measures the performance of the Guardrails application, running on CPU-only laptop or instance. |
| 12 | +It doesn't require GPUs on which to run local models, or access to the internet to use models hosted by providers. |
| 13 | +All models use the [Mock LLM Server](mock_llm_server), which is a simplified model of an LLM used for inference. |
| 14 | +The aim of this benchmark is to detect performance-regressions as quickly as running unit-tests. |
| 15 | + |
| 16 | +## Quickstart: Running Guardrails with Mock LLMs |
| 17 | +To run Guardrails with mocks for both the content-safety and main LLM, follow the steps below. |
| 18 | +All commands must be run in the `nemoguardrails/benchmark` directory. |
| 19 | +These assume you already have a working environment after following the steps in [CONTRIBUTING.md](../../CONTRIBUTING.md). |
| 20 | + |
| 21 | +First, we need to install the honcho and langchain-nvidia-ai-endpoints packages. |
| 22 | +The `honcho` package is used to run Procfile-based applications, and is a Python port of [Foreman](https://github.com/ddollar/foreman). |
| 23 | +The `langchain-nvidia-ai-endpoints` package is used to communicate with Mock LLMs via Langchain. |
| 24 | + |
| 25 | +```shell |
| 26 | +# Install dependencies |
| 27 | +$ poetry run pip install honcho langchain-nvidia-ai-endpoints |
| 28 | +... |
| 29 | +Successfully installed filetype-1.2.0 honcho-2.0.0 langchain-nvidia-ai-endpoints-0.3.19 |
| 30 | +``` |
| 31 | + |
| 32 | +Now we can start up the processes that are part of the [Procfile](Procfile). |
| 33 | +As the Procfile processes spin up, they log to the console with a prefix. The `system` prefix is used by Honcho, `app_llm` is the Application or Main LLM mock, `cs_llm` is the content-safety mock, and `gr` is the Guardrails service. We'll explore the Procfile in more detail below. |
| 34 | +Once the three 'Uvicorn running on ...' messages are printed, you can move to the next step. Note these messages are likely not on consecutive lines. |
| 35 | + |
| 36 | +``` |
| 37 | +# All commands must be run in the nemoguardrails/benchmark directory |
| 38 | +$ cd nemoguardrails/benchmark |
| 39 | +$ poetry run honcho start |
| 40 | +13:40:33 system | gr.1 started (pid=93634) |
| 41 | +13:40:33 system | app_llm.1 started (pid=93635) |
| 42 | +13:40:33 system | cs_llm.1 started (pid=93636) |
| 43 | +... |
| 44 | +13:40:41 app_llm.1 | INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) |
| 45 | +... |
| 46 | +13:40:41 cs_llm.1 | INFO: Uvicorn running on http://0.0.0.0:8001 (Press CTRL+C to quit) |
| 47 | +... |
| 48 | +13:40:45 gr.1 | INFO: Uvicorn running on http://0.0.0.0:9000 (Press CTRL+C to quit) |
| 49 | +``` |
| 50 | + |
| 51 | +Once Guardrails and the mock servers are up, we can use the `validate_mocks.py` script to check they're healthy and serving the correct models. |
| 52 | + |
| 53 | +```shell |
| 54 | +$ cd nemoguardrails/benchmark |
| 55 | +$ poetry run python validate_mocks.py |
| 56 | +Starting LLM endpoint health check... |
| 57 | + |
| 58 | +--- Checking Port: 8000 --- |
| 59 | +Checking http://localhost:8000/health ... |
| 60 | +HTTP Request: GET http://localhost:8000/health "HTTP/1.1 200 OK" |
| 61 | +Health Check PASSED: Status is 'healthy'. |
| 62 | +Checking http://localhost:8000/v1/models for 'meta/llama-3.3-70b-instruct'... |
| 63 | +HTTP Request: GET http://localhost:8000/v1/models "HTTP/1.1 200 OK" |
| 64 | +Model Check PASSED: Found 'meta/llama-3.3-70b-instruct' in model list. |
| 65 | +--- Port 8000: ALL CHECKS PASSED --- |
| 66 | + |
| 67 | +--- Checking Port: 8001 --- |
| 68 | +Checking http://localhost:8001/health ... |
| 69 | +HTTP Request: GET http://localhost:8001/health "HTTP/1.1 200 OK" |
| 70 | +Health Check PASSED: Status is 'healthy'. |
| 71 | +Checking http://localhost:8001/v1/models for 'nvidia/llama-3.1-nemoguard-8b-content-safety'... |
| 72 | +HTTP Request: GET http://localhost:8001/v1/models "HTTP/1.1 200 OK" |
| 73 | +Model Check PASSED: Found 'nvidia/llama-3.1-nemoguard-8b-content-safety' in model list. |
| 74 | +--- Port 8001: ALL CHECKS PASSED --- |
| 75 | + |
| 76 | +--- Checking Port: 9000 (Rails Config) --- |
| 77 | +Checking http://localhost:9000/v1/rails/configs ... |
| 78 | +HTTP Request: GET http://localhost:9000/v1/rails/configs "HTTP/1.1 200 OK" |
| 79 | +HTTP Status PASSED: Got 200. |
| 80 | +Body Check PASSED: Response is an array with at least one entry. |
| 81 | +--- Port 9000: ALL CHECKS PASSED --- |
| 82 | + |
| 83 | +--- Final Summary --- |
| 84 | +Port 8000 (meta/llama-3.3-70b-instruct): PASSED |
| 85 | +Port 8001 (nvidia/llama-3.1-nemoguard-8b-content-safety): PASSED |
| 86 | +Port 9000 (Rails Config): PASSED |
| 87 | +--------------------- |
| 88 | +Overall Status: All endpoints are healthy! |
| 89 | +``` |
| 90 | + |
| 91 | +Once the mocks and Guardrails are running and the script passes, we can issue curl requests against the Guardrails `/chat/completions` endpoint to generate a response and test the system end-to-end. |
| 92 | + |
| 93 | +```shell |
| 94 | +curl -s -X POST http://0.0.0.0:9000/v1/chat/completions \ |
| 95 | + -H 'Accept: application/json' \ |
| 96 | + -H 'Content-Type: application/json' \ |
| 97 | + -d '{ |
| 98 | + "model": "meta/llama-3.3-70b-instruct", |
| 99 | + "messages": [ |
| 100 | + { |
| 101 | + "role": "user", |
| 102 | + "content": "what can you do for me?" |
| 103 | + } |
| 104 | + ], |
| 105 | + "stream": false |
| 106 | + }' | jq |
| 107 | +{ |
| 108 | + "messages": [ |
| 109 | + { |
| 110 | + "role": "assistant", |
| 111 | + "content": "I can provide information and help with a wide range of topics, from science and history to entertainment and culture. I can also help with language-related tasks, such as translation and text summarization. However, I can't assist with requests that involve harm or illegal activities." |
| 112 | + } |
| 113 | + ] |
| 114 | +} |
| 115 | + |
| 116 | +``` |
| 117 | + |
| 118 | +------ |
| 119 | + |
| 120 | +## Deep-Dive: Configuration |
| 121 | + |
| 122 | +In this section, we'll examine the configuration files used in the quickstart above. This gives more context on how the system works, and can be extended as needed. |
| 123 | + |
| 124 | +### Procfile |
| 125 | + |
| 126 | +The [Procfile](Procfile?raw=true) contains all the processes that make up the application. |
| 127 | +The Honcho package reads in this file, starts all the processes, and combines their logs to the console |
| 128 | +The `gr` line runs the Guardrails server on port 9000 and sets the default Guardrails configuration as [content_safety_colang1](configs/guardrail_configs/content_safety_colang1?raw=true). |
| 129 | +The `app_llm` line runs the Application or Main Mock LLM. Guardrails calls this LLM to generate a response to the user's query. This server uses 4 uvicorn workers and runs on port 8000. The configuration file here is a Mock LLM configuration, not a Guardrails configuration. |
| 130 | +The `cs_llm` line runs the Content-Safety Mock LLM. This uses 4 uvicorn workers and runs on port 8001. |
| 131 | + |
| 132 | +### Guardrails Configuration |
| 133 | +The [Guardrails Configuration](configs/guardrail_configs/content_safety_colang1/config.yml) is used by the Guardrails server. |
| 134 | +Under the `models` section, the `main` model is used to generate responses to the user queries. The base URL for this model is the `app_llm` Mock LLM from the Procfile, running on port 8000. The `model` field has to match the Mock LLM model name. |
| 135 | +The `content_safety` model is configured for use in an input and output rail. The `type` field matches the `$model` used in the input and output flows. |
| 136 | + |
| 137 | +### Mock LLM Endpoints |
| 138 | +The Mock LLM implements a subset of the OpenAI LLM API. |
| 139 | +There are two Mock LLM configurations, one for the Mock [main model](configs/mock_configs/meta-llama-3.3-70b-instruct.env), and another for the Mock [content-safety](configs/mock_configs/nvidia-llama-3.1-nemoguard-8b-content-safety.env) model. |
| 140 | +The Mock LLM has the following OpenAI-compatible endpoints: |
| 141 | + |
| 142 | +* `/health`: Returns a JSON object with status set to healthy and timestamp in seconds-since-epoch. For example `{"status":"healthy","timestamp":1762781239}` |
| 143 | +* `/v1/models`: Returns the `MODEL` field from the Mock configuration (see below). For example `{"object":"list","data":[{"id":"meta/llama-3.3-70b-instruct","object":"model","created":1762781290,"owned_by":"system"}]}` |
| 144 | +* `/v1/completions`: Returns an [OpenAI completion object](https://platform.openai.com/docs/api-reference/completions/object) using the Mock configuration (see below). |
| 145 | +* `/v1/chat/completions`: Returns an [OpenAI chat completion object](https://platform.openai.com/docs/api-reference/chat/object) using the Mock configuration (see below). |
| 146 | + |
| 147 | +### Mock LLM Configuration |
| 148 | +Mock LLMs are configured using the `.env` file format. These files are passed to the Mock LLM using the `--config-file` argument. |
| 149 | +The Mock LLMs return either a `SAFE_TEXT` or `UNSAFE_TEXT` response to `/v1/completions` or `/v1/chat/completions` inference requests. |
| 150 | +The probability of the `UNSAFE_TEXT` being returned if given by `UNSAFE_PROBABILITY`. |
| 151 | +The latency of each response is also controllable, and works as follows: |
| 152 | + |
| 153 | +* Latency is first sampled from a normal distribution with mean `LATENCY_MEAN_SECONDS` and standard deviation `LATENCY_STD_SECONDS`. |
| 154 | +* If the sampled value is less than `LATENCY_MIN_SECONDS`, it is set to `LATENCY_MIN_SECONDS`. |
| 155 | +* If the sampled value is less than `LATENCY_MAX_SECONDS`, it is set to `LATENCY_MAX_SECONDS`. |
| 156 | + |
| 157 | +The full list of configuration fields is shown below: |
| 158 | +* `MODEL`: The Model name served by the Mock LLM. This will be returned on the `/v1/models` endpoint. |
| 159 | +* `UNSAFE_PROBABILITY`: Probability of an unsafe response. This is a probability, and must be in the range [0, 1]. |
| 160 | +* `UNSAFE_TEXT`: String returned as an unsafe response. |
| 161 | +* `SAFE_TEXT`: String returned as a safe response. |
| 162 | +* `LATENCY_MIN_SECONDS`: Minimum latency in seconds. |
| 163 | +* `LATENCY_MAX_SECONDS`: Maximum latency in seconds. |
| 164 | +* `LATENCY_MEAN_SECONDS`: Normal distribution mean from which to sample latency. |
| 165 | +* `LATENCY_STD_SECONDS`: Normal distribution standard deviation from which to sample latency. |
0 commit comments