Benchmark Suite

Hello and welcome to Benchmark Suite! This is a modern, header-only C++20 benchmarking library with cross-platform hardware performance counter integration, providing precise measurements of cycles, instructions, branches, cache behavior, and throughput with minimal overhead.

The following operating systems and compilers are officially supported:

Compiler Support

Minimum Requirements:

C++20 compliant compiler
GCC 13+ | Clang 16+ | MSVC 2022+
CUDA 11.0+ (for GPU benchmarking)

Operating System Support

Quickstart Guide for benchmarksuite v1.0.2

This guide will walk you through setting up and running benchmarks using benchmarksuite.

Installation

Method 1: vcpkg + CMake (Recommended)

Step 1: Add to vcpkg.json

Create or update your vcpkg.json in your project root:

{
  "name": "your-project-name",
  "version": "1.0.0",
  "dependencies": [
    "rtc-benchmarksuite"
  ]
}

Step 2: Configure CMake

In your CMakeLists.txt:

cmake_minimum_required(VERSION 3.20)
project(YourProject LANGUAGES CXX CUDA)

set(CMAKE_CXX_STANDARD 20)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

set(CMAKE_CUDA_STANDARD 20)
set(CMAKE_CUDA_STANDARD_REQUIRED ON)

find_package(benchmarksuite CONFIG REQUIRED)

add_executable(your_benchmark main.cpp)

target_link_libraries(your_benchmark PRIVATE benchmarksuite::benchmarksuite)

set_target_properties(your_benchmark PROPERTIES CUDA_SEPARABLE_COMPILATION ON)

Step 3: Configure with vcpkg toolchain

cmake -B build -S . -DCMAKE_TOOLCHAIN_FILE=[path-to-vcpkg]/scripts/buildsystems/vcpkg.cmake

cmake --build build --config Release

Step 4: Include in your code

#include <bnch_swt/index.hpp>

int main() {
    return 0;
}

Method 2: Manual Installation

If not using vcpkg, you can include benchmarksuite as a header-only library:

Step 1: Clone the repository

git clone https://github.com/RealTimeChris/benchmarksuite.git

Step 2: Add to CMake

add_subdirectory(path/to/benchmarksuite)

target_include_directories(your_target PRIVATE path/to/benchmarksuite/include)

Step 3: Include headers

#include <bnch_swt/index.hpp>

Requirements

To use benchmarksuite, ensure you have a C++20 (or later) compliant compiler.

For CPU Benchmarking:

MSVC 2022 or later
GCC 13 or later
Clang 16 or later

For GPU/CUDA Benchmarking:

NVIDIA CUDA Toolkit 11.0 or later
NVCC compiler
CUDA-capable GPU

Platform-Specific Notes

Windows:

Use Visual Studio 2022 or later
For CUDA: Install CUDA Toolkit from NVIDIA

Linux:

Install build essentials: sudo apt-get install build-essential
For CUDA: Install CUDA Toolkit via package manager or NVIDIA installer

macOS:

Install Xcode Command Line Tools
CUDA support not available on Apple Silicon (M1/M2/M3)

Verification

Verify your installation with a simple test:

#include <bnch_swt/index.hpp>
#include <iostream>

int main() {
    std::cout << "benchmarksuite successfully installed!" << std::endl;
    return 0;
}

Basic Example

The following example demonstrates how to set up and run a benchmark comparing two integer-to-string conversion functions:

struct glz_to_chars_benchmark {
    BNCH_SWT_HOST static uint64_t impl(std::vector<int64_t>& test_values, 
                                        std::vector<std::string>& test_values_00,
                                        std::vector<std::string>& test_values_01) {
        uint64_t bytes_processed = 0;
        char newer_string[30]{};
        for (uint64_t x = 0; x < test_values.size(); ++x) {
            std::memset(newer_string, '\0', sizeof(newer_string));
            auto new_ptr = glz::to_chars(newer_string, test_values[x]);
            bytes_processed += test_values_00[x].size();
            test_values_01[x] = std::string{newer_string, static_cast<uint64_t>(new_ptr - newer_string)};
        }
        return bytes_processed;
    }
};

struct jsonifier_to_chars_benchmark {
    BNCH_SWT_HOST static uint64_t impl(std::vector<int64_t>& test_values,
                                        std::vector<std::string>& test_values_00,
                                        std::vector<std::string>& test_values_01) {
        uint64_t bytes_processed = 0;
        char newer_string[30]{};
        for (uint64_t x = 0; x < test_values.size(); ++x) {
            std::memset(newer_string, '\0', sizeof(newer_string));
            auto new_ptr = jsonifier_internal::to_chars(newer_string, test_values[x]);
            bytes_processed += test_values_00[x].size();
            test_values_01[x] = std::string{newer_string, static_cast<uint64_t>(new_ptr - newer_string)};
        }
        return bytes_processed;
    }
};

int main() {
    constexpr bnch_swt::stage_config config{
        .max_execution_count = 200,
        .measured_iteration_count = 25,
        .benchmark_type = bnch_swt::benchmark_types::cpu,
        .desired_percentage_deviation = 1.0,
        .max_time_seconds = 5.5
    };
    
    constexpr uint64_t count = 512;
    
    std::vector<int64_t> test_values = generate_random_integers<int64_t>(count, 20);
    std::vector<std::string> test_values_00;
    std::vector<std::string> test_values_01(count);
    
    for (uint64_t x = 0; x < count; ++x) {
        test_values_00.emplace_back(std::to_string(test_values[x]));
    }
    
    using benchmark = bnch_swt::benchmark_stage<"int-to-string-comparison", config>;
    
    benchmark::run_benchmark<"conversion-test", "glz::to_chars", glz_to_chars_benchmark>(
        test_values, test_values_00, test_values_01);
    benchmark::run_benchmark<"conversion-test", "jsonifier::to_chars", jsonifier_to_chars_benchmark>(
        test_values, test_values_00, test_values_01);
    
    benchmark::print_results();
    
    return 0;
}

Creating Benchmarks

To create a benchmark:

Define your benchmark functions as structs with a static impl() method that returns uint64_t (bytes processed)
Use bnch_swt::benchmark_stage with stage_config for configuration
Call run_benchmark with test name, subject name, benchmark struct, and arguments

Benchmark Stage

The benchmark_stage structure orchestrates each test and supports both CPU and GPU benchmarking:

template<bnch_swt::string_literal stage_name,
         bnch_swt::stage_config stage_config_new = bnch_swt::stage_config{},
         bnch_swt::string_literal metric_name = bnch_swt::string_literal<1>{}
>
struct benchmark_stage;

// Default configuration
using cpu_benchmark = bnch_swt::benchmark_stage<"my-benchmark">;

// Custom configuration
constexpr bnch_swt::stage_config gpu_config{
    .max_execution_count = 100,
    .measured_iteration_count = 10,
    .benchmark_type = bnch_swt::benchmark_types::cuda,
    .clear_cpu_cache_between_each_iteration = false,
    .clear_cpu_cache_before_all_iterations = true,
    .desired_percentage_deviation = 0.5,
    .max_time_seconds = 3.0
};
using gpu_benchmark = bnch_swt::benchmark_stage<"gpu-test", gpu_config>;

// Custom metric name
using compression_bench = bnch_swt::benchmark_stage<"compression", stage_config{}, "compression-ratio">;

Stage Configuration

The stage_config struct controls benchmark behavior:

struct stage_config {
    uint64_t max_execution_count{ 200 };                    // Maximum iterations (including warmup)
    uint64_t measured_iteration_count{ 10 };                // Number of iterations to measure
    benchmark_types benchmark_type{ benchmark_types::cpu }; // CPU or CUDA
    bool clear_cpu_cache_between_each_iteration{ false };   // Clear cache between iterations
    bool clear_cpu_cache_before_all_iterations{ true };     // Clear cache before starting
    double desired_percentage_deviation{ 1.0 };              // Target stability threshold (%)
    double max_time_seconds{ 5.5 };                         // Maximum runtime limit
};

Methods

`run_benchmark<test_name, subject_name, function_type>(args...)`

Executes the benchmark using a struct with a static impl() method. The benchmark automatically scales iterations until statistical stability is reached.

Parameters:

test_name: String literal grouping related benchmarks together
subject_name: String literal identifying this specific implementation
function_type: Struct type with a static impl() method
args...: Arguments forwarded to the impl() method

Returns: Reference to performance_metrics<benchmark_type> object

Example:

struct my_benchmark {
    BNCH_SWT_HOST static uint64_t impl(std::vector<int>& data) {
        uint64_t sum = 0;
        for (auto& val : data) {
            sum += val;
        }
        return data.size() * sizeof(int);
    }
};

constexpr bnch_swt::stage_config config{ .max_execution_count = 500, .measured_iteration_count = 50 };
using bench = bnch_swt::benchmark_stage<"test", config>;
std::vector<int> data(1000);
bench::run_benchmark<"math-test", "my-implementation", my_benchmark>(data);

`run_benchmark<test_name, subject_name, function>(args...)`

Executes the benchmark using a function or lambda directly (passed as non-type template parameter).

Parameters:

test_name: String literal grouping related benchmarks
subject_name: String literal identifying this specific implementation
function: Function or lambda to benchmark (as non-type template parameter)
args...: Arguments forwarded to the function

Example:

constexpr auto my_lambda = [](std::vector<int>& data) -> uint64_t {
    uint64_t sum = 0;
    for (auto& val : data) {
        sum += val;
    }
    return data.size() * sizeof(int);
};

constexpr bnch_swt::stage_config config{ .max_execution_count = 500, .measured_iteration_count = 50 };
using bench = bnch_swt::benchmark_stage<"test", config>;
std::vector<int> data(1000);
bench::run_benchmark<"math-test", "my-implementation", my_lambda>(data);

`run_benchmark_from_host<test_name, subject_name, function_type>(bytes_processed, args...)`

Executes CUDA benchmarks launched from host code.

Parameters:

test_name: String literal grouping related benchmarks
subject_name: String literal identifying this specific implementation
function_type: Function type to benchmark
bytes_processed: Number of bytes processed per iteration
args...: Arguments forwarded to the function

Example:

struct cuda_host_launcher {
    static void impl(float* gpu_data, uint64_t size) {
        dim3 grid{256};
        dim3 block{256};
        my_kernel<<<grid, block>>>(gpu_data, size);
        cudaDeviceSynchronize();
    }
};

constexpr bnch_swt::stage_config config{ 
    .benchmark_type = bnch_swt::benchmark_types::cuda,
    .measured_iteration_count = 10
};
using bench = bnch_swt::benchmark_stage<"cuda-test", config>;
float* gpu_data;
cudaMalloc(&gpu_data, 1024 * sizeof(float));
bench::run_benchmark_from_host<"kernel-test", "my-kernel", cuda_host_launcher>(
    1024 * sizeof(float), gpu_data, 1024);

`run_benchmark_cooperative<test_name, subject_name, function>(args...)`

Executes CUDA cooperative group kernels requiring grid-wide synchronization.

Parameters:

test_name: String literal grouping related benchmarks
subject_name: String literal identifying this specific implementation
function: Function to benchmark (as non-type template parameter)
args...: Arguments forwarded to the function

`print_results<metrics_presence>(show_metrics)`

Displays performance metrics with statistical analysis, rankings, and confidence intervals.

Parameters:

metrics_presence: Template parameter controlling which metrics to display
show_metrics: Whether to show detailed hardware counter metrics

Example:

benchmark::print_results();  // Default metrics

// Custom metric selection
bnch_swt::performance_metrics_presence<bnch_swt::benchmark_types::cpu> custom_metrics{};
custom_metrics.throughput_mb_per_sec = true;
custom_metrics.cycles_per_byte = true;
custom_metrics.instructions_per_cycle = true;
benchmark::print_results<custom_metrics>(true);

`generate_markdown(title, file_path)`

Generates a formatted Markdown report of all benchmark results.

Parameters:

title: Title for the report
file_path: Optional directory path to save the report (auto-named with OS and compiler)

Returns: std::string containing the Markdown report

Example:

auto report = benchmark::generate_markdown("Performance Analysis", "./results/");
std::cout << report << std::endl;

`get_all_results()`

Returns all results organized by test name.

Returns: std::vector<stage_results<stage_name, benchmark_type>::test_results>

`get_test_results(test_name)`

Returns results for a specific test name.

Returns: std::unordered_map<std::string_view, performance_metrics<benchmark_type>>

`clear_all_results()`

Resets all collected results for the stage.

Benchmark Function Requirements

Benchmark functions must be defined as structs with a static impl() method:

For CPU benchmarks:

struct my_cpu_benchmark {
    BNCH_SWT_HOST static uint64_t impl(/* your parameters */) {
        uint64_t bytes_processed = /* calculate bytes */;
        return bytes_processed;
    }
};

For CUDA benchmarks:

struct my_cuda_benchmark {
    BNCH_SWT_DEVICE static void impl(/* your parameters */) {
        int idx = blockIdx.x * blockDim.x + threadIdx.x;
        // kernel code here
    }
};

Key differences:

CPU: impl() returns uint64_t (bytes processed) and uses BNCH_SWT_HOST
CUDA: impl() returns void, uses BNCH_SWT_DEVICE, and contains kernel code
CUDA: Bytes processed is passed as a parameter to run_benchmark_from_host()

Adaptive Benchmarking

As of v1.0.2, benchmarksuite features adaptive iteration scaling that automatically determines the optimal number of iterations for statistical stability.

How It Works

Starts with small iteration count: Begins with measured_iteration_count * 2 iterations
Sliding window analysis: Evaluates all consecutive windows of measured_iteration_count iterations
Stability detection: Continues until throughput deviation ≤ desired_percentage_deviation
Iteration doubling: Doubles iteration count each round until stability or limits reached
Time protection: Automatically stops after max_time_seconds to prevent excessively long runs

Configuration Example

constexpr bnch_swt::stage_config precise_config{
    .max_execution_count = 10000,           // Upper limit
    .measured_iteration_count = 50,          // Window size for analysis
    .desired_percentage_deviation = 0.5,     // Target: 0.5% stability
    .max_time_seconds = 10.0                // Max 10 seconds per benchmark
};

using bench = bnch_swt::benchmark_stage<"precise-benchmark", precise_config>;

Benefits

No more guessing: No need to manually tune iteration counts
Comparable results: All benchmarks achieve similar statistical confidence
Time-efficient: Stops early for stable code, continues longer for noisy measurements
Reproducible: Same configuration produces consistent stability across runs

Statistical Analysis

Benchmark results now include 95% confidence interval analysis with automatic tie detection and ranking.

Statistical Features

Confidence intervals: Calculated from throughput deviation percentages
Statistical tie detection: Identifies when implementations are statistically indistinguishable
Automated ranking: Orders results with proper tie handling
Win/loss/tie tracking: Summary statistics across multiple tests
Markdown export: Professional reports for documentation

Output Example

=== STATISTICAL SUMMARY FOR int-to-string-comparison ===
(95% confidence intervals, statistical ties don't count as wins)

jsonifier::to_chars: 1 wins
glz::to_chars: 0 wins (1 second place)

=== STATISTICAL TIES (no clear winner) ===
fast_float: 2 tests where statistically tied for first

Understanding Statistical Ties

When two implementations have overlapping confidence intervals, they are considered statistically tied - neither is significantly faster than the other. This is reported clearly to prevent over-interpretation of small performance differences.

CPU vs GPU Benchmarking

Benchmarksuite supports both CPU and GPU (CUDA) benchmarking through the benchmark_type enum in stage_config.

CPU Benchmarks

struct cpu_computation_benchmark {
    BNCH_SWT_HOST static uint64_t impl(const std::vector<float>& input, std::vector<float>& output) {
        for (size_t i = 0; i < input.size(); ++i) {
            output[i] = std::sqrt(input[i] * input[i] + 1.0f);
        }
        return input.size() * sizeof(float);
    }
};

constexpr bnch_swt::stage_config cpu_config{
    .benchmark_type = bnch_swt::benchmark_types::cpu,
    .max_execution_count = 200,
    .measured_iteration_count = 25
};

using cpu_stage = bnch_swt::benchmark_stage<"cpu-test", cpu_config>;

constexpr size_t data_size = 1024 * 1024;
std::vector<float> input(data_size, 1.0f);
std::vector<float> output(data_size);

cpu_stage::run_benchmark<"math-test", "cpu-impl", cpu_computation_benchmark>(input, output);
cpu_stage::print_results();

GPU/CUDA Benchmarks

struct cuda_kernel_benchmark {
    BNCH_SWT_DEVICE static void impl(float* data, uint64_t size) {
        int idx = blockIdx.x * blockDim.x + threadIdx.x;
        if (idx < size) {
            data[idx] = data[idx] * 2.0f;
        }
    }
};

constexpr bnch_swt::stage_config gpu_config{
    .benchmark_type = bnch_swt::benchmark_types::cuda,
    .max_execution_count = 100,
    .measured_iteration_count = 10,
    .max_time_seconds = 5.0
};

using cuda_stage = bnch_swt::benchmark_stage<"gpu-test", gpu_config>;

constexpr uint64_t data_size = 1024 * 1024;
float* gpu_data;
cudaMalloc(&gpu_data, data_size * sizeof(float));

dim3 grid{256, 1, 1};
dim3 block{256, 1, 1};
uint64_t bytes_processed = data_size * sizeof(float);

cuda_stage::run_benchmark_from_host<"kernel-test", "gpu-impl", cuda_kernel_benchmark>(
    bytes_processed, grid, block, 0, gpu_data, data_size);

cuda_stage::print_results();
cudaFree(gpu_data);

Mixed CPU/GPU Benchmarking

Compare CPU and GPU implementations side-by-side:

constexpr uint64_t data_size = 1024 * 1024;

// CPU implementation
struct cpu_process {
    BNCH_SWT_HOST static uint64_t impl(std::vector<float>& cpu_data) {
        for (size_t i = 0; i < cpu_data.size(); ++i) {
            cpu_data[i] = cpu_data[i] * 2.0f;
        }
        return cpu_data.size() * sizeof(float);
    }
};

// GPU implementation
struct gpu_process {
    BNCH_SWT_DEVICE static void impl(float* gpu_data, uint64_t size) {
        int idx = blockIdx.x * blockDim.x + threadIdx.x;
        if (idx < size) {
            gpu_data[idx] = gpu_data[idx] * 2.0f;
        }
    }
};

// Run both
constexpr bnch_swt::stage_config config{ .max_execution_count = 100, .measured_iteration_count = 10 };
using stage = bnch_swt::benchmark_stage<"cpu-vs-gpu", config>;

std::vector<float> cpu_data(data_size);
float* gpu_data;
cudaMalloc(&gpu_data, data_size * sizeof(float));

stage::run_benchmark<"vector-multiply", "cpu-version", cpu_process>(cpu_data);

dim3 grid{(data_size + 255) / 256, 1, 1};
dim3 block{256, 1, 1};
stage::run_benchmark_from_host<"vector-multiply", "gpu-version", gpu_process>(
    data_size * sizeof(float), grid, block, 0, gpu_data, data_size);

stage::print_results();
cudaFree(gpu_data);

Cache Clearing Option

For accurate cold-cache CPU benchmarks:

constexpr bnch_swt::stage_config cold_cache_config{
    .clear_cpu_cache_between_each_iteration = true,
    .clear_cpu_cache_before_all_iterations = true,
    .max_execution_count = 200,
    .measured_iteration_count = 25
};

using cold_bench = bnch_swt::benchmark_stage<"cache-test", cold_cache_config>;

Custom Metrics

Specify custom metric names for specialized benchmarks:

constexpr bnch_swt::stage_config config{ .max_execution_count = 200, .measured_iteration_count = 25 };
using compression_bench = bnch_swt::benchmark_stage<"compression-test", config, "compression-ratio">;

struct compress_benchmark {
    BNCH_SWT_HOST static uint64_t impl(const std::vector<uint8_t>& input) {
        auto compressed = compress_data(input);
        return (input.size() * 1000) / compressed.size();  // Ratio * 1000
    }
};

compression_bench::run_benchmark<"compression", "my-compressor", compress_benchmark>(input_data);
compression_bench::print_results();  // Shows "compression-ratio" instead of MB/s

Advanced Benchmark Methods

Host-Launched Kernels

Use run_benchmark_from_host() for custom CUDA kernel configurations:

struct custom_launcher {
    static void impl(float* data, uint64_t size, int shared_bytes) {
        dim3 grid{static_cast<unsigned int>((size + 255) / 256)};
        dim3 block{256};
        my_kernel<<<grid, block, shared_bytes>>>(data, size);
        cudaDeviceSynchronize();
    }
};

constexpr bnch_swt::stage_config config{ .benchmark_type = bnch_swt::benchmark_types::cuda };
using bench = bnch_swt::benchmark_stage<"custom-kernel", config>;

bench::run_benchmark_from_host<"launch-test", "custom", custom_launcher>(
    data_size * sizeof(float), gpu_data, data_size, 4096);

Cooperative Kernels

Use run_benchmark_cooperative() for kernels requiring grid-wide sync:

constexpr auto cooperative_reduce = [](float* data, float* result, uint64_t size) -> uint64_t {
    cooperative_groups::grid_group grid = cooperative_groups::this_grid();
    // reduction logic here
    grid.sync();
    return size * sizeof(float);
};

bench::run_benchmark_cooperative<"reduce-test", "grid-reduce", cooperative_reduce>(
    grid, block, shared_mem, stream, bytes_processed, gpu_data, gpu_result, size);

Running Benchmarks

With vcpkg + CMake (recommended):

cmake -B build -S . -DCMAKE_TOOLCHAIN_FILE=[path-to-vcpkg]/scripts/buildsystems/vcpkg.cmake -DCMAKE_BUILD_TYPE=Release

cmake --build build --config Release

./build/your_benchmark
.\build\Release\your_benchmark.exe

Manual CMake build:

cmake -B build -S . -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release
./build/your_benchmark

For CUDA benchmarks, specify target architecture:

cmake -B build -S . \
  -DCMAKE_TOOLCHAIN_FILE=[path-to-vcpkg]/scripts/buildsystems/vcpkg.cmake \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_CUDA_ARCHITECTURES=86
  
cmake --build build --config Release

Common CMake Options

-DCMAKE_BUILD_TYPE=Release - Build optimized release version
-DCMAKE_CUDA_ARCHITECTURES=86 - Target specific CUDA compute capability (e.g., 86 for RTX 30xx/40xx)
-DCMAKE_CXX_COMPILER=clang++ - Specify C++ compiler
-DCMAKE_CUDA_COMPILER=nvcc - Specify CUDA compiler

Complete Project Example

Project structure:

my-benchmark/
├── CMakeLists.txt
├── vcpkg.json
├── main.cpp
└── benchmarks/
    ├── cpu_benchmark.hpp
    └── gpu_benchmark.cuh

CMakeLists.txt:

cmake_minimum_required(VERSION 3.20)
project(MyBenchmark LANGUAGES CXX CUDA)

set(CMAKE_CXX_STANDARD 20)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

set(CMAKE_CUDA_STANDARD 20)
set(CMAKE_CUDA_STANDARD_REQUIRED ON)

find_package(benchmarksuite CONFIG REQUIRED)

add_executable(my_benchmark 
    main.cpp
    benchmarks/cpu_benchmark.hpp
    benchmarks/gpu_benchmark.cuh
)

target_link_libraries(my_benchmark PRIVATE 
    benchmarksuite::benchmarksuite
)

set_target_properties(my_benchmark PROPERTIES
    CUDA_SEPARABLE_COMPILATION ON
    CUDA_RESOLVE_DEVICE_SYMBOLS ON
)

if(MSVC)
    target_compile_options(my_benchmark PRIVATE /O2 /arch:AVX2)
else()
    target_compile_options(my_benchmark PRIVATE -O3 -march=native)
endif()

vcpkg.json:

{
  "name": "my-benchmark",
  "version": "1.0.0",
  "dependencies": [
    "rtc-benchmarksuite"
  ]
}

Output and Results

Standard Output

----------------------------------------
CPU Performance Metrics for Stage: int-to-string-comparison
Running on: AMD Ryzen 9 7950X 16-Core Processor
OS: Linux 6.8.0
Compiler: GNU 13.2.0
----------------------------------------
Test: conversion-test
----------------------------------------
1. jsonifier::to_chars (84.58 MB/s +/-1.23%) | ~11.36% faster than glz::to_chars
----------------------------------------
Metrics for: jsonifier::to_chars
Total Iterations to Stabilize: 394
Measured Iterations: 20
Bytes Processed: 512.00
Nanoseconds per Execution: 5785.25
Frequency (GHz): 4.83
Throughput (MB/s): 84.58
Throughput Percentage Deviation (+/-%): 1.23
Cycles per Execution: 27921.20
Cycles per Byte: 54.53
Instructions per Execution: 52026.00
Instructions per Cycle: 1.86
Instructions per Byte: 101.61
----------------------------------------
2. glz::to_chars (75.95 MB/s +/-2.17%)
----------------------------------------
Metrics for: glz::to_chars
Total Iterations to Stabilize: 421
Measured Iterations: 20
Bytes Processed: 512.00
Nanoseconds per Execution: 6480.30
Throughput (MB/s): 75.95
Throughput Percentage Deviation (+/-%): 2.17
Cycles per Execution: 30314.40
Cycles per Byte: 59.21
Instructions per Execution: 51513.00
Instructions per Cycle: 1.70
----------------------------------------

=== STATISTICAL SUMMARY FOR int-to-string-comparison ===
(95% confidence intervals, statistical ties don't count as wins)

jsonifier::to_chars: 1 wins
glz::to_chars: 0 wins (1 second place)

Markdown Report Generation

auto report = benchmark::generate_markdown("Performance Analysis", "./results/");

Generates formatted Markdown files with complete statistical analysis, perfect for CI/CD documentation.

Features

Dual Benchmarking Support

CPU Benchmarking: Traditional CPU performance measurement with hardware counters
GPU/CUDA Benchmarking: Native CUDA kernel benchmarking with grid/block configuration
Mixed Workloads: Compare CPU vs GPU implementations side-by-side
Automatic Device Selection: Choose benchmark type via stage_config

Adaptive Benchmarking (v1.0.2+)

Automatic iteration scaling: Dynamically increases iterations until statistical stability
Sliding window analysis: Finds optimal consecutive block of iterations
Percentage deviation targeting: Configurable stability threshold (default: 1%)
Time-based termination: Maximum runtime protection (default: 5.5 seconds)

Statistical Analysis (v1.0.2+)

Confidence intervals: 95% confidence intervals for throughput comparisons
Statistical tie detection: Automatically identifies indistinguishable implementations
Automated ranking: Orders results with proper tie handling
Win/loss/tie tracking: Summary statistics across multiple tests
Markdown export: Professional reports for documentation

Advanced Execution Modes

Standard Benchmarking: Default run_benchmark() with adaptive iteration scaling
Host-Launched Kernels: run_benchmark_from_host() for custom kernel launch configurations
Cooperative Groups: run_benchmark_cooperative() for grid-wide synchronization
Function or Struct: Support for both function-based and struct-based benchmarks

Advanced Options

Cache Clearing: Optional cache eviction between iterations for cold-cache benchmarks
Custom Metrics: Define custom metric names for specialized benchmarks
Configurable Iterations: Control over warmup iterations and measured iterations via stage_config
Programmatic Access: Retrieve raw performance metrics via get_test_results()
Selective Metric Display: Customize which metrics are shown in output

Hardware Introspection

CPU Properties: Comprehensive CPU detection and automatic reporting
GPU Properties: CUDA device detection and reporting
Compiler Info: Automatic compiler ID and version capture
OS Detection: Operating system name and version in results

Performance Counters

Cross-platform CPU counters: Windows, Linux, macOS, Apple ARM
CUDA performance events: GPU-specific performance monitoring

API Conventions

As of v1.0.0, all APIs follow snake_case naming convention:

Functions: do_not_optimize_away(), generate_random_integers(), print_results()
Types: size_type, string_literal, stage_config
Variables: bytes_processed, test_values

Migrating from v1.0.0

If you're upgrading from v1.0.0 to v1.0.2:

1. Update stage_config usage

Old:

bnch_swt::benchmark_stage<"test", 200, 25, bnch_swt::benchmark_types::cpu>

New:

constexpr bnch_swt::stage_config config{
    .max_execution_count = 200,
    .measured_iteration_count = 25,
    .benchmark_type = bnch_swt::benchmark_types::cpu
};
bnch_swt::benchmark_stage<"test", config>

2. Add test name parameter to run_benchmark

Old:

benchmark_stage::run_benchmark<"subject_name", function_type>(args...)

New:

benchmark_stage::run_benchmark<"test_name", "subject_name", function_type>(args...)

3. Update run_from_host to run_benchmark_from_host

Old:

benchmark_stage::run_from_host<"subject_name", function_type>(bytes_processed, args...)

New:

benchmark_stage::run_benchmark_from_host<"test_name", "subject_name", function_type>(
    bytes_processed, args...)

4. Remove show_comparison parameter

Old:

benchmark_stage::print_results(true, true)

New:

benchmark_stage::print_results(true)  // show_metrics only

5. Update result access

Old:

auto results = benchmark_stage::get_results();  // vector<performance_metrics>

New:

auto all = benchmark_stage::get_all_results();           // vector<test_results>
auto test = benchmark_stage::get_test_results("test_name");  // unordered_map<string_view, metrics>
benchmark_stage::clear_all_results();  // New method

6. (Optional) Enable adaptive benchmarking features

constexpr bnch_swt::stage_config adaptive_config{
    .max_execution_count = 10000,
    .measured_iteration_count = 50,
    .desired_percentage_deviation = 0.5,  // New: target stability
    .max_time_seconds = 10.0              // New: time limit
};

Happy benchmarking with benchmarksuite v1.0.2! 🚀

For issues, feature requests, or contributions, please visit the GitHub repository.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.github/workflows		.github/workflows
build_tools		build_tools
cmake		cmake
include/bnch_swt		include/bnch_swt
src		src
vcpkg		vcpkg
.clang-format		.clang-format
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
CMakePresets.json		CMakePresets.json
License.md		License.md
ReadMe.md		ReadMe.md

Folders and files

Latest commit

History

Repository files navigation

Benchmark Suite

Compiler Support

Operating System Support

Quickstart Guide for benchmarksuite v1.0.2

Table of Contents

Installation

Method 1: vcpkg + CMake (Recommended)

Method 2: Manual Installation

Requirements

Platform-Specific Notes

Verification

Basic Example

Creating Benchmarks

Benchmark Stage

Stage Configuration

Methods

run_benchmark<test_name, subject_name, function_type>(args...)

run_benchmark<test_name, subject_name, function>(args...)

run_benchmark_from_host<test_name, subject_name, function_type>(bytes_processed, args...)

run_benchmark_cooperative<test_name, subject_name, function>(args...)

print_results<metrics_presence>(show_metrics)

generate_markdown(title, file_path)

get_all_results()

get_test_results(test_name)

clear_all_results()

Benchmark Function Requirements

Adaptive Benchmarking

How It Works

Configuration Example

Benefits

Statistical Analysis

Statistical Features

Output Example

Understanding Statistical Ties

CPU vs GPU Benchmarking

CPU Benchmarks

GPU/CUDA Benchmarks

Mixed CPU/GPU Benchmarking

Cache Clearing Option

Custom Metrics

Advanced Benchmark Methods

Host-Launched Kernels

Cooperative Kernels

Running Benchmarks

Common CMake Options

Complete Project Example

Output and Results

Standard Output

Markdown Report Generation

Features

Dual Benchmarking Support

Adaptive Benchmarking (v1.0.2+)

Statistical Analysis (v1.0.2+)

Advanced Execution Modes

Advanced Options

Hardware Introspection

Performance Counters

API Conventions

Migrating from v1.0.0

1. Update stage_config usage

2. Add test name parameter to run_benchmark

3. Update run_from_host to run_benchmark_from_host

4. Remove show_comparison parameter

5. Update result access

6. (Optional) Enable adaptive benchmarking features

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Uh oh!

`run_benchmark<test_name, subject_name, function_type>(args...)`

`run_benchmark<test_name, subject_name, function>(args...)`

`run_benchmark_from_host<test_name, subject_name, function_type>(bytes_processed, args...)`

`run_benchmark_cooperative<test_name, subject_name, function>(args...)`

`print_results<metrics_presence>(show_metrics)`

`generate_markdown(title, file_path)`

`get_all_results()`

`get_test_results(test_name)`

`clear_all_results()`

Packages