Hello and welcome to Benchmark Suite! This is a modern, header-only C++20 benchmarking library with cross-platform hardware performance counter integration, providing precise measurements of cycles, instructions, branches, cache behavior, and throughput with minimal overhead.
The following operating systems and compilers are officially supported:
Minimum Requirements:
- C++20 compliant compiler
- GCC 13+ | Clang 16+ | MSVC 2022+
- CUDA 11.0+ (for GPU benchmarking)
This guide will walk you through setting up and running benchmarks using benchmarksuite.
- Installation
- Basic Example
- Creating Benchmarks
- Adaptive Benchmarking
- Statistical Analysis
- CPU vs GPU Benchmarking
- Advanced Benchmark Methods
- Running Benchmarks
- Output and Results
- Features
- API Conventions
- Migrating from v1.0.0
Step 1: Add to vcpkg.json
Create or update your vcpkg.json in your project root:
{
"name": "your-project-name",
"version": "1.0.0",
"dependencies": [
"rtc-benchmarksuite"
]
}Step 2: Configure CMake
In your CMakeLists.txt:
cmake_minimum_required(VERSION 3.20)
project(YourProject LANGUAGES CXX CUDA)
set(CMAKE_CXX_STANDARD 20)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_CUDA_STANDARD 20)
set(CMAKE_CUDA_STANDARD_REQUIRED ON)
find_package(benchmarksuite CONFIG REQUIRED)
add_executable(your_benchmark main.cpp)
target_link_libraries(your_benchmark PRIVATE benchmarksuite::benchmarksuite)
set_target_properties(your_benchmark PROPERTIES CUDA_SEPARABLE_COMPILATION ON)Step 3: Configure with vcpkg toolchain
cmake -B build -S . -DCMAKE_TOOLCHAIN_FILE=[path-to-vcpkg]/scripts/buildsystems/vcpkg.cmake
cmake --build build --config ReleaseStep 4: Include in your code
#include <bnch_swt/index.hpp>
int main() {
return 0;
}If not using vcpkg, you can include benchmarksuite as a header-only library:
Step 1: Clone the repository
git clone https://github.com/RealTimeChris/benchmarksuite.gitStep 2: Add to CMake
add_subdirectory(path/to/benchmarksuite)
target_include_directories(your_target PRIVATE path/to/benchmarksuite/include)Step 3: Include headers
#include <bnch_swt/index.hpp>To use benchmarksuite, ensure you have a C++20 (or later) compliant compiler.
For CPU Benchmarking:
- MSVC 2022 or later
- GCC 13 or later
- Clang 16 or later
For GPU/CUDA Benchmarking:
- NVIDIA CUDA Toolkit 11.0 or later
- NVCC compiler
- CUDA-capable GPU
Windows:
- Use Visual Studio 2022 or later
- For CUDA: Install CUDA Toolkit from NVIDIA
Linux:
- Install build essentials:
sudo apt-get install build-essential - For CUDA: Install CUDA Toolkit via package manager or NVIDIA installer
macOS:
- Install Xcode Command Line Tools
- CUDA support not available on Apple Silicon (M1/M2/M3)
Verify your installation with a simple test:
#include <bnch_swt/index.hpp>
#include <iostream>
int main() {
std::cout << "benchmarksuite successfully installed!" << std::endl;
return 0;
}The following example demonstrates how to set up and run a benchmark comparing two integer-to-string conversion functions:
struct glz_to_chars_benchmark {
BNCH_SWT_HOST static uint64_t impl(std::vector<int64_t>& test_values,
std::vector<std::string>& test_values_00,
std::vector<std::string>& test_values_01) {
uint64_t bytes_processed = 0;
char newer_string[30]{};
for (uint64_t x = 0; x < test_values.size(); ++x) {
std::memset(newer_string, '\0', sizeof(newer_string));
auto new_ptr = glz::to_chars(newer_string, test_values[x]);
bytes_processed += test_values_00[x].size();
test_values_01[x] = std::string{newer_string, static_cast<uint64_t>(new_ptr - newer_string)};
}
return bytes_processed;
}
};
struct jsonifier_to_chars_benchmark {
BNCH_SWT_HOST static uint64_t impl(std::vector<int64_t>& test_values,
std::vector<std::string>& test_values_00,
std::vector<std::string>& test_values_01) {
uint64_t bytes_processed = 0;
char newer_string[30]{};
for (uint64_t x = 0; x < test_values.size(); ++x) {
std::memset(newer_string, '\0', sizeof(newer_string));
auto new_ptr = jsonifier_internal::to_chars(newer_string, test_values[x]);
bytes_processed += test_values_00[x].size();
test_values_01[x] = std::string{newer_string, static_cast<uint64_t>(new_ptr - newer_string)};
}
return bytes_processed;
}
};
int main() {
constexpr bnch_swt::stage_config config{
.max_execution_count = 200,
.measured_iteration_count = 25,
.benchmark_type = bnch_swt::benchmark_types::cpu,
.desired_percentage_deviation = 1.0,
.max_time_seconds = 5.5
};
constexpr uint64_t count = 512;
std::vector<int64_t> test_values = generate_random_integers<int64_t>(count, 20);
std::vector<std::string> test_values_00;
std::vector<std::string> test_values_01(count);
for (uint64_t x = 0; x < count; ++x) {
test_values_00.emplace_back(std::to_string(test_values[x]));
}
using benchmark = bnch_swt::benchmark_stage<"int-to-string-comparison", config>;
benchmark::run_benchmark<"conversion-test", "glz::to_chars", glz_to_chars_benchmark>(
test_values, test_values_00, test_values_01);
benchmark::run_benchmark<"conversion-test", "jsonifier::to_chars", jsonifier_to_chars_benchmark>(
test_values, test_values_00, test_values_01);
benchmark::print_results();
return 0;
}To create a benchmark:
- Define your benchmark functions as structs with a static
impl()method that returnsuint64_t(bytes processed) - Use
bnch_swt::benchmark_stagewithstage_configfor configuration - Call
run_benchmarkwith test name, subject name, benchmark struct, and arguments
The benchmark_stage structure orchestrates each test and supports both CPU and GPU benchmarking:
template<bnch_swt::string_literal stage_name,
bnch_swt::stage_config stage_config_new = bnch_swt::stage_config{},
bnch_swt::string_literal metric_name = bnch_swt::string_literal<1>{}
>
struct benchmark_stage;
// Default configuration
using cpu_benchmark = bnch_swt::benchmark_stage<"my-benchmark">;
// Custom configuration
constexpr bnch_swt::stage_config gpu_config{
.max_execution_count = 100,
.measured_iteration_count = 10,
.benchmark_type = bnch_swt::benchmark_types::cuda,
.clear_cpu_cache_between_each_iteration = false,
.clear_cpu_cache_before_all_iterations = true,
.desired_percentage_deviation = 0.5,
.max_time_seconds = 3.0
};
using gpu_benchmark = bnch_swt::benchmark_stage<"gpu-test", gpu_config>;
// Custom metric name
using compression_bench = bnch_swt::benchmark_stage<"compression", stage_config{}, "compression-ratio">;The stage_config struct controls benchmark behavior:
struct stage_config {
uint64_t max_execution_count{ 200 }; // Maximum iterations (including warmup)
uint64_t measured_iteration_count{ 10 }; // Number of iterations to measure
benchmark_types benchmark_type{ benchmark_types::cpu }; // CPU or CUDA
bool clear_cpu_cache_between_each_iteration{ false }; // Clear cache between iterations
bool clear_cpu_cache_before_all_iterations{ true }; // Clear cache before starting
double desired_percentage_deviation{ 1.0 }; // Target stability threshold (%)
double max_time_seconds{ 5.5 }; // Maximum runtime limit
};Executes the benchmark using a struct with a static impl() method. The benchmark automatically scales iterations until statistical stability is reached.
Parameters:
- test_name: String literal grouping related benchmarks together
- subject_name: String literal identifying this specific implementation
- function_type: Struct type with a static
impl()method - args...: Arguments forwarded to the
impl()method
Returns: Reference to performance_metrics<benchmark_type> object
Example:
struct my_benchmark {
BNCH_SWT_HOST static uint64_t impl(std::vector<int>& data) {
uint64_t sum = 0;
for (auto& val : data) {
sum += val;
}
return data.size() * sizeof(int);
}
};
constexpr bnch_swt::stage_config config{ .max_execution_count = 500, .measured_iteration_count = 50 };
using bench = bnch_swt::benchmark_stage<"test", config>;
std::vector<int> data(1000);
bench::run_benchmark<"math-test", "my-implementation", my_benchmark>(data);Executes the benchmark using a function or lambda directly (passed as non-type template parameter).
Parameters:
- test_name: String literal grouping related benchmarks
- subject_name: String literal identifying this specific implementation
- function: Function or lambda to benchmark (as non-type template parameter)
- args...: Arguments forwarded to the function
Example:
constexpr auto my_lambda = [](std::vector<int>& data) -> uint64_t {
uint64_t sum = 0;
for (auto& val : data) {
sum += val;
}
return data.size() * sizeof(int);
};
constexpr bnch_swt::stage_config config{ .max_execution_count = 500, .measured_iteration_count = 50 };
using bench = bnch_swt::benchmark_stage<"test", config>;
std::vector<int> data(1000);
bench::run_benchmark<"math-test", "my-implementation", my_lambda>(data);Executes CUDA benchmarks launched from host code.
Parameters:
- test_name: String literal grouping related benchmarks
- subject_name: String literal identifying this specific implementation
- function_type: Function type to benchmark
- bytes_processed: Number of bytes processed per iteration
- args...: Arguments forwarded to the function
Example:
struct cuda_host_launcher {
static void impl(float* gpu_data, uint64_t size) {
dim3 grid{256};
dim3 block{256};
my_kernel<<<grid, block>>>(gpu_data, size);
cudaDeviceSynchronize();
}
};
constexpr bnch_swt::stage_config config{
.benchmark_type = bnch_swt::benchmark_types::cuda,
.measured_iteration_count = 10
};
using bench = bnch_swt::benchmark_stage<"cuda-test", config>;
float* gpu_data;
cudaMalloc(&gpu_data, 1024 * sizeof(float));
bench::run_benchmark_from_host<"kernel-test", "my-kernel", cuda_host_launcher>(
1024 * sizeof(float), gpu_data, 1024);Executes CUDA cooperative group kernels requiring grid-wide synchronization.
Parameters:
- test_name: String literal grouping related benchmarks
- subject_name: String literal identifying this specific implementation
- function: Function to benchmark (as non-type template parameter)
- args...: Arguments forwarded to the function
Displays performance metrics with statistical analysis, rankings, and confidence intervals.
Parameters:
- metrics_presence: Template parameter controlling which metrics to display
- show_metrics: Whether to show detailed hardware counter metrics
Example:
benchmark::print_results(); // Default metrics
// Custom metric selection
bnch_swt::performance_metrics_presence<bnch_swt::benchmark_types::cpu> custom_metrics{};
custom_metrics.throughput_mb_per_sec = true;
custom_metrics.cycles_per_byte = true;
custom_metrics.instructions_per_cycle = true;
benchmark::print_results<custom_metrics>(true);Generates a formatted Markdown report of all benchmark results.
Parameters:
- title: Title for the report
- file_path: Optional directory path to save the report (auto-named with OS and compiler)
Returns: std::string containing the Markdown report
Example:
auto report = benchmark::generate_markdown("Performance Analysis", "./results/");
std::cout << report << std::endl;Returns all results organized by test name.
Returns: std::vector<stage_results<stage_name, benchmark_type>::test_results>
Returns results for a specific test name.
Returns: std::unordered_map<std::string_view, performance_metrics<benchmark_type>>
Resets all collected results for the stage.
Benchmark functions must be defined as structs with a static impl() method:
For CPU benchmarks:
struct my_cpu_benchmark {
BNCH_SWT_HOST static uint64_t impl(/* your parameters */) {
uint64_t bytes_processed = /* calculate bytes */;
return bytes_processed;
}
};For CUDA benchmarks:
struct my_cuda_benchmark {
BNCH_SWT_DEVICE static void impl(/* your parameters */) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// kernel code here
}
};Key differences:
- CPU:
impl()returnsuint64_t(bytes processed) and usesBNCH_SWT_HOST - CUDA:
impl()returnsvoid, usesBNCH_SWT_DEVICE, and contains kernel code - CUDA: Bytes processed is passed as a parameter to
run_benchmark_from_host()
As of v1.0.2, benchmarksuite features adaptive iteration scaling that automatically determines the optimal number of iterations for statistical stability.
- Starts with small iteration count: Begins with
measured_iteration_count * 2iterations - Sliding window analysis: Evaluates all consecutive windows of
measured_iteration_countiterations - Stability detection: Continues until throughput deviation ≤
desired_percentage_deviation - Iteration doubling: Doubles iteration count each round until stability or limits reached
- Time protection: Automatically stops after
max_time_secondsto prevent excessively long runs
constexpr bnch_swt::stage_config precise_config{
.max_execution_count = 10000, // Upper limit
.measured_iteration_count = 50, // Window size for analysis
.desired_percentage_deviation = 0.5, // Target: 0.5% stability
.max_time_seconds = 10.0 // Max 10 seconds per benchmark
};
using bench = bnch_swt::benchmark_stage<"precise-benchmark", precise_config>;- No more guessing: No need to manually tune iteration counts
- Comparable results: All benchmarks achieve similar statistical confidence
- Time-efficient: Stops early for stable code, continues longer for noisy measurements
- Reproducible: Same configuration produces consistent stability across runs
Benchmark results now include 95% confidence interval analysis with automatic tie detection and ranking.
- Confidence intervals: Calculated from throughput deviation percentages
- Statistical tie detection: Identifies when implementations are statistically indistinguishable
- Automated ranking: Orders results with proper tie handling
- Win/loss/tie tracking: Summary statistics across multiple tests
- Markdown export: Professional reports for documentation
=== STATISTICAL SUMMARY FOR int-to-string-comparison ===
(95% confidence intervals, statistical ties don't count as wins)
jsonifier::to_chars: 1 wins
glz::to_chars: 0 wins (1 second place)
=== STATISTICAL TIES (no clear winner) ===
fast_float: 2 tests where statistically tied for first
When two implementations have overlapping confidence intervals, they are considered statistically tied - neither is significantly faster than the other. This is reported clearly to prevent over-interpretation of small performance differences.
Benchmarksuite supports both CPU and GPU (CUDA) benchmarking through the benchmark_type enum in stage_config.
struct cpu_computation_benchmark {
BNCH_SWT_HOST static uint64_t impl(const std::vector<float>& input, std::vector<float>& output) {
for (size_t i = 0; i < input.size(); ++i) {
output[i] = std::sqrt(input[i] * input[i] + 1.0f);
}
return input.size() * sizeof(float);
}
};
constexpr bnch_swt::stage_config cpu_config{
.benchmark_type = bnch_swt::benchmark_types::cpu,
.max_execution_count = 200,
.measured_iteration_count = 25
};
using cpu_stage = bnch_swt::benchmark_stage<"cpu-test", cpu_config>;
constexpr size_t data_size = 1024 * 1024;
std::vector<float> input(data_size, 1.0f);
std::vector<float> output(data_size);
cpu_stage::run_benchmark<"math-test", "cpu-impl", cpu_computation_benchmark>(input, output);
cpu_stage::print_results();struct cuda_kernel_benchmark {
BNCH_SWT_DEVICE static void impl(float* data, uint64_t size) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < size) {
data[idx] = data[idx] * 2.0f;
}
}
};
constexpr bnch_swt::stage_config gpu_config{
.benchmark_type = bnch_swt::benchmark_types::cuda,
.max_execution_count = 100,
.measured_iteration_count = 10,
.max_time_seconds = 5.0
};
using cuda_stage = bnch_swt::benchmark_stage<"gpu-test", gpu_config>;
constexpr uint64_t data_size = 1024 * 1024;
float* gpu_data;
cudaMalloc(&gpu_data, data_size * sizeof(float));
dim3 grid{256, 1, 1};
dim3 block{256, 1, 1};
uint64_t bytes_processed = data_size * sizeof(float);
cuda_stage::run_benchmark_from_host<"kernel-test", "gpu-impl", cuda_kernel_benchmark>(
bytes_processed, grid, block, 0, gpu_data, data_size);
cuda_stage::print_results();
cudaFree(gpu_data);Compare CPU and GPU implementations side-by-side:
constexpr uint64_t data_size = 1024 * 1024;
// CPU implementation
struct cpu_process {
BNCH_SWT_HOST static uint64_t impl(std::vector<float>& cpu_data) {
for (size_t i = 0; i < cpu_data.size(); ++i) {
cpu_data[i] = cpu_data[i] * 2.0f;
}
return cpu_data.size() * sizeof(float);
}
};
// GPU implementation
struct gpu_process {
BNCH_SWT_DEVICE static void impl(float* gpu_data, uint64_t size) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < size) {
gpu_data[idx] = gpu_data[idx] * 2.0f;
}
}
};
// Run both
constexpr bnch_swt::stage_config config{ .max_execution_count = 100, .measured_iteration_count = 10 };
using stage = bnch_swt::benchmark_stage<"cpu-vs-gpu", config>;
std::vector<float> cpu_data(data_size);
float* gpu_data;
cudaMalloc(&gpu_data, data_size * sizeof(float));
stage::run_benchmark<"vector-multiply", "cpu-version", cpu_process>(cpu_data);
dim3 grid{(data_size + 255) / 256, 1, 1};
dim3 block{256, 1, 1};
stage::run_benchmark_from_host<"vector-multiply", "gpu-version", gpu_process>(
data_size * sizeof(float), grid, block, 0, gpu_data, data_size);
stage::print_results();
cudaFree(gpu_data);For accurate cold-cache CPU benchmarks:
constexpr bnch_swt::stage_config cold_cache_config{
.clear_cpu_cache_between_each_iteration = true,
.clear_cpu_cache_before_all_iterations = true,
.max_execution_count = 200,
.measured_iteration_count = 25
};
using cold_bench = bnch_swt::benchmark_stage<"cache-test", cold_cache_config>;Specify custom metric names for specialized benchmarks:
constexpr bnch_swt::stage_config config{ .max_execution_count = 200, .measured_iteration_count = 25 };
using compression_bench = bnch_swt::benchmark_stage<"compression-test", config, "compression-ratio">;
struct compress_benchmark {
BNCH_SWT_HOST static uint64_t impl(const std::vector<uint8_t>& input) {
auto compressed = compress_data(input);
return (input.size() * 1000) / compressed.size(); // Ratio * 1000
}
};
compression_bench::run_benchmark<"compression", "my-compressor", compress_benchmark>(input_data);
compression_bench::print_results(); // Shows "compression-ratio" instead of MB/sUse run_benchmark_from_host() for custom CUDA kernel configurations:
struct custom_launcher {
static void impl(float* data, uint64_t size, int shared_bytes) {
dim3 grid{static_cast<unsigned int>((size + 255) / 256)};
dim3 block{256};
my_kernel<<<grid, block, shared_bytes>>>(data, size);
cudaDeviceSynchronize();
}
};
constexpr bnch_swt::stage_config config{ .benchmark_type = bnch_swt::benchmark_types::cuda };
using bench = bnch_swt::benchmark_stage<"custom-kernel", config>;
bench::run_benchmark_from_host<"launch-test", "custom", custom_launcher>(
data_size * sizeof(float), gpu_data, data_size, 4096);Use run_benchmark_cooperative() for kernels requiring grid-wide sync:
constexpr auto cooperative_reduce = [](float* data, float* result, uint64_t size) -> uint64_t {
cooperative_groups::grid_group grid = cooperative_groups::this_grid();
// reduction logic here
grid.sync();
return size * sizeof(float);
};
bench::run_benchmark_cooperative<"reduce-test", "grid-reduce", cooperative_reduce>(
grid, block, shared_mem, stream, bytes_processed, gpu_data, gpu_result, size);With vcpkg + CMake (recommended):
cmake -B build -S . -DCMAKE_TOOLCHAIN_FILE=[path-to-vcpkg]/scripts/buildsystems/vcpkg.cmake -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release
./build/your_benchmark
.\build\Release\your_benchmark.exeManual CMake build:
cmake -B build -S . -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release
./build/your_benchmarkFor CUDA benchmarks, specify target architecture:
cmake -B build -S . \
-DCMAKE_TOOLCHAIN_FILE=[path-to-vcpkg]/scripts/buildsystems/vcpkg.cmake \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_CUDA_ARCHITECTURES=86
cmake --build build --config Release-DCMAKE_BUILD_TYPE=Release- Build optimized release version-DCMAKE_CUDA_ARCHITECTURES=86- Target specific CUDA compute capability (e.g., 86 for RTX 30xx/40xx)-DCMAKE_CXX_COMPILER=clang++- Specify C++ compiler-DCMAKE_CUDA_COMPILER=nvcc- Specify CUDA compiler
Project structure:
my-benchmark/
├── CMakeLists.txt
├── vcpkg.json
├── main.cpp
└── benchmarks/
├── cpu_benchmark.hpp
└── gpu_benchmark.cuh
CMakeLists.txt:
cmake_minimum_required(VERSION 3.20)
project(MyBenchmark LANGUAGES CXX CUDA)
set(CMAKE_CXX_STANDARD 20)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_CUDA_STANDARD 20)
set(CMAKE_CUDA_STANDARD_REQUIRED ON)
find_package(benchmarksuite CONFIG REQUIRED)
add_executable(my_benchmark
main.cpp
benchmarks/cpu_benchmark.hpp
benchmarks/gpu_benchmark.cuh
)
target_link_libraries(my_benchmark PRIVATE
benchmarksuite::benchmarksuite
)
set_target_properties(my_benchmark PROPERTIES
CUDA_SEPARABLE_COMPILATION ON
CUDA_RESOLVE_DEVICE_SYMBOLS ON
)
if(MSVC)
target_compile_options(my_benchmark PRIVATE /O2 /arch:AVX2)
else()
target_compile_options(my_benchmark PRIVATE -O3 -march=native)
endif()vcpkg.json:
{
"name": "my-benchmark",
"version": "1.0.0",
"dependencies": [
"rtc-benchmarksuite"
]
}----------------------------------------
CPU Performance Metrics for Stage: int-to-string-comparison
Running on: AMD Ryzen 9 7950X 16-Core Processor
OS: Linux 6.8.0
Compiler: GNU 13.2.0
----------------------------------------
Test: conversion-test
----------------------------------------
1. jsonifier::to_chars (84.58 MB/s +/-1.23%) | ~11.36% faster than glz::to_chars
----------------------------------------
Metrics for: jsonifier::to_chars
Total Iterations to Stabilize: 394
Measured Iterations: 20
Bytes Processed: 512.00
Nanoseconds per Execution: 5785.25
Frequency (GHz): 4.83
Throughput (MB/s): 84.58
Throughput Percentage Deviation (+/-%): 1.23
Cycles per Execution: 27921.20
Cycles per Byte: 54.53
Instructions per Execution: 52026.00
Instructions per Cycle: 1.86
Instructions per Byte: 101.61
----------------------------------------
2. glz::to_chars (75.95 MB/s +/-2.17%)
----------------------------------------
Metrics for: glz::to_chars
Total Iterations to Stabilize: 421
Measured Iterations: 20
Bytes Processed: 512.00
Nanoseconds per Execution: 6480.30
Throughput (MB/s): 75.95
Throughput Percentage Deviation (+/-%): 2.17
Cycles per Execution: 30314.40
Cycles per Byte: 59.21
Instructions per Execution: 51513.00
Instructions per Cycle: 1.70
----------------------------------------
=== STATISTICAL SUMMARY FOR int-to-string-comparison ===
(95% confidence intervals, statistical ties don't count as wins)
jsonifier::to_chars: 1 wins
glz::to_chars: 0 wins (1 second place)
auto report = benchmark::generate_markdown("Performance Analysis", "./results/");Generates formatted Markdown files with complete statistical analysis, perfect for CI/CD documentation.
- CPU Benchmarking: Traditional CPU performance measurement with hardware counters
- GPU/CUDA Benchmarking: Native CUDA kernel benchmarking with grid/block configuration
- Mixed Workloads: Compare CPU vs GPU implementations side-by-side
- Automatic Device Selection: Choose benchmark type via
stage_config
- Automatic iteration scaling: Dynamically increases iterations until statistical stability
- Sliding window analysis: Finds optimal consecutive block of iterations
- Percentage deviation targeting: Configurable stability threshold (default: 1%)
- Time-based termination: Maximum runtime protection (default: 5.5 seconds)
- Confidence intervals: 95% confidence intervals for throughput comparisons
- Statistical tie detection: Automatically identifies indistinguishable implementations
- Automated ranking: Orders results with proper tie handling
- Win/loss/tie tracking: Summary statistics across multiple tests
- Markdown export: Professional reports for documentation
- Standard Benchmarking: Default
run_benchmark()with adaptive iteration scaling - Host-Launched Kernels:
run_benchmark_from_host()for custom kernel launch configurations - Cooperative Groups:
run_benchmark_cooperative()for grid-wide synchronization - Function or Struct: Support for both function-based and struct-based benchmarks
- Cache Clearing: Optional cache eviction between iterations for cold-cache benchmarks
- Custom Metrics: Define custom metric names for specialized benchmarks
- Configurable Iterations: Control over warmup iterations and measured iterations via
stage_config - Programmatic Access: Retrieve raw performance metrics via
get_test_results() - Selective Metric Display: Customize which metrics are shown in output
- CPU Properties: Comprehensive CPU detection and automatic reporting
- GPU Properties: CUDA device detection and reporting
- Compiler Info: Automatic compiler ID and version capture
- OS Detection: Operating system name and version in results
- Cross-platform CPU counters: Windows, Linux, macOS, Apple ARM
- CUDA performance events: GPU-specific performance monitoring
As of v1.0.0, all APIs follow snake_case naming convention:
- Functions:
do_not_optimize_away(),generate_random_integers(),print_results() - Types:
size_type,string_literal,stage_config - Variables:
bytes_processed,test_values
If you're upgrading from v1.0.0 to v1.0.2:
Old:
bnch_swt::benchmark_stage<"test", 200, 25, bnch_swt::benchmark_types::cpu>New:
constexpr bnch_swt::stage_config config{
.max_execution_count = 200,
.measured_iteration_count = 25,
.benchmark_type = bnch_swt::benchmark_types::cpu
};
bnch_swt::benchmark_stage<"test", config>Old:
benchmark_stage::run_benchmark<"subject_name", function_type>(args...)New:
benchmark_stage::run_benchmark<"test_name", "subject_name", function_type>(args...)Old:
benchmark_stage::run_from_host<"subject_name", function_type>(bytes_processed, args...)New:
benchmark_stage::run_benchmark_from_host<"test_name", "subject_name", function_type>(
bytes_processed, args...)Old:
benchmark_stage::print_results(true, true)New:
benchmark_stage::print_results(true) // show_metrics onlyOld:
auto results = benchmark_stage::get_results(); // vector<performance_metrics>New:
auto all = benchmark_stage::get_all_results(); // vector<test_results>
auto test = benchmark_stage::get_test_results("test_name"); // unordered_map<string_view, metrics>
benchmark_stage::clear_all_results(); // New methodconstexpr bnch_swt::stage_config adaptive_config{
.max_execution_count = 10000,
.measured_iteration_count = 50,
.desired_percentage_deviation = 0.5, // New: target stability
.max_time_seconds = 10.0 // New: time limit
};Happy benchmarking with benchmarksuite v1.0.2! 🚀
For issues, feature requests, or contributions, please visit the GitHub repository.