Skip to content

Optimise getValueOfBits and insertBits with BMI2 PEXT/PDEP (#717)#796

Open
nez0b wants to merge 2 commits into
QuEST-Kit:develfrom
nez0b:bmi2-bitwise-717
Open

Optimise getValueOfBits and insertBits with BMI2 PEXT/PDEP (#717)#796
nez0b wants to merge 2 commits into
QuEST-Kit:develfrom
nez0b:bmi2-bitwise-717

Conversation

@nez0b

@nez0b nez0b commented Jun 15, 2026

Copy link
Copy Markdown

Closes #717.

Summary

This adds guarded x86 BMI2 fast paths for QuEST's hot bit gather/scatter helpers, it hoists the loop-invariant work out of the 2^N statevector loops so each call collapses to a single instruction:

  • insertBitsWithMaskedValues(...) (the gate-application path): I build the position mask once per
    gate
    and use _pdep_u64, instead of rebuilding it on every amplitude.
  • getValueOfBits(...) (measurement / diagonal-matrix / projector): _pext_u64 when the qubits are
    strictly increasing, falling back to the original loop otherwise.
  • CMakeLists.txt adds a QUEST_ENABLE_BMI2 option (off by default): a default build stays
    portable scalar — no BMI2 in the binary, so it can't SIGILL on a pre-BMI2 CPU — and
    -DQUEST_ENABLE_BMI2=ON wires -mbmi2 (PEXT/PDEP) through the library (which then requires a
    BMI2-capable CPU at runtime).

The earlier PRs rebuilt the mask / re-checked sortedness inside
the per-amplitude function; here those are loop-invariant and computed once.

A note on impact. In practice this is a modest optimisation, and I want to be upfront about
that. The per-call speedup is large (6–12× in cache), but non-trivial state-vector simulation is
memory-bandwidth-bound, so most of that win is hidden: end-to-end I measure only ~1.0–1.3×
single-threaded, collapsing toward 1× (occasionally below) once the state is DRAM-bound or the run
saturates memory bandwidth across threads. It is probably not a large win for
big production runs.

Benchmarks

"Per call" means one invocation of the bit gather/scatter helper (getValueOfBits /
insertBitsWithMaskedValues). The plot below isolates that instruction. it's the bit-op run in a tight
loop with the result kept in a register, no statevector traffic.

Pure bit-op speedup: PDEP/PEXT are about 6–12× faster than the scalar loops in cache (k=6, single thread),
compressing to ~4–7× at the DRAM wall (one thread can't saturate memory):

pr_speedup

That isolated number is the ceiling, not what a gate sees — inside a real kernel the bit-op only
computes an index, and the strided complex-amplitude load/store it gates dominates the cost (see Thread
scaling below). The three views form one memory-roofline ladder — isolated instruction ~6–12× →
inside a gate ~2.5× → whole circuit ~1.0–1.3×
— each diluted by progressively more memory traffic.

End-to-end on whole circuits (AI assisted benchmark generation) (Xeon 6448H, single thread, all bit-identical to the scalar build):

circuit nq=12 nq=16
QFT (controlled-phase ladder) 1.26× 1.01×
random / supremacy 1.16× 1.27×
Grover (many-controlled) 1.17× 1.27×
VQE + measurement 1.15× 1.28×

So the end-to-end win is modest — roughly 1.0–1.3× single-threaded, biggest where index math
dominates (controlled-phase-heavy circuits) on cache-resident states, and it shrinks toward 1× (and
occasionally below — e.g. 0.84× at nq=20 / 32 threads) once the state is DRAM-bound or threads
saturate bandwidth.

Thread scaling — the bit-op inside a real gate kernel (Xeon 6448H, L3-resident q=18). Unlike the
isolated plot above, each call here also reads/writes the strided complex amplitudes it indexes (~64 B of
scattered memory per call vs the plot's single sequential 8-byte read), so memory, not the instruction,
sets the pace:

threads 1 8 32 64 128
PDEP (scatter) 2.53× 1.30× 1.15× 1.11× 1.01×
PEXT (sorted gather) 3.89× 3.95× 2.52× 2.02× 1.22×

Even at 1 thread this is only 2.5–3.9× — not the ~7–8× the isolated plot shows at q=18 — because the
strided amplitude traffic already dominates; it then erodes toward ~1× by 128 threads as bandwidth
saturates. (The 1–32-thread runs are pinned to one idle-gated socket; 64/128 spill across this shared
node's other sockets, so read those two columns as trend rather than precise figures.)

I also tried VPSHUFBITQMB (AVX-512)

Because getValueOfBits can be handed unsorted qubits, I benchmarked the AVX-512 arbitrary-order
gather (VPSHUFBITQMB, BITALG) as an alternative to PEXT. Per-call, in cache (gather):

gather, in-cache ns/call vs scalar needs
scalar loop 4.36 1.0×
PEXT (sorted only) 0.49 8.9× BMI2
VPSHUFBITQMB (any order) 0.69 6.3× AVX-512 BITALG
VPSHUFBITQMB ×8 (any order, batched) 0.41 10.6× AVX-512 BITALG + loop rewrite

I didn't use it, for three reasons: (1) it needs AVX-512 BITALG (Ice Lake-SP+ / Zen 4+), so it would
need runtime CPU dispatch rather than a compile-time guard; (2) for the common sorted case PEXT is
already faster per call; (3) the only thing it uniquely buys — the unsorted gather — is a tiny slice
of the work. Instrumenting the four circuit families, scatter (PDEP) outweighs gather by ~58:1,
and the unsorted part is ~0.5% of bitwise work. BMI2-only keeps the diff small and portable for the
whole win.

Tests

  • tests/unit/bitwise.cpp — checks the new mask-accepting helpers are bit-identical to the original
    getValueOfBits / insertBitsWithMaskedValues over exhaustive-small and randomised inputs, plus
    deterministic boundary cases at bits 31/32/61/62/63 (incl. the int64 sign bit), and
    isStrictlyIncreasing. Built with -DQUEST_ENABLE_BMI2=ON it's compiled with -mbmi2, so it
    exercises the real PEXT/PDEP path; 41,849 assertions pass (and pass equally on the scalar fallback).
  • examples/automated/benchmark_bitwise_bmi2.cpp — a quick benchmark that prints the timings and
    which path was compiled in
    , so a non-x86 / non--mbmi2 CI runner just prints the scalar timings
    rather than hitting SIGILL. Runs in well under a second.

AI assistance

I used an AI assistant to help survey the candidate instructions (PEXT/PDEP, VPSHUFBITQMB, GFNI), draft
the hoisting and the compile guards, and assemble the benchmarks. I reviewed the diff myself, kept the
arbitrary-order semantics of getValueOfBits, and ran the validation above before opening this.

Closes #717

nez0b and others added 2 commits June 15, 2026 20:25
…#717)

Replace the per-amplitude looped bit gather/scatter in the CPU statevector/
density-matrix kernels with x86 BMI2 PEXT/PDEP, hoisting the loop-invariant
masks out of the 2^N loops so each per-amplitude call becomes one instruction:

- insertBitsWithMaskedValues sites: compute the position mask once per gate and
  use _pdep_u64 (order-invariant scatter; unconditionally correct).
- getValueOfBits sites: _pext_u64 when the qubits are strictly increasing,
  falling back to the original scalar loop otherwise (order is preserved).

Portability: BMI2 is opt-in via a new QUEST_ENABLE_BMI2 CMake option, OFF by
default, so a default build stays portable scalar (no BMI2 in the binary, no
SIGILL on pre-BMI2 CPUs). Enabling it wires -mbmi2 through the library; the
intrinsics are additionally guarded to x86 host TUs (never CUDA/HIP device
code). The scalar fallback is byte-identical, and QUEST_BITWISE_FORCE_SCALAR
forces it on a BMI2-capable host.

Tests/benchmark: tests/unit/bitwise.cpp asserts the new helpers are
bit-identical to the originals over exhaustive-small, randomised, and boundary
inputs (bits 31/32/61/62/63 incl. the int64 sign bit); examples/automated adds
a cross-platform benchmark that prints timings and which path was compiled in.

Bit-identical to the scalar path (verified by unit tests, the QuEST suite for
the touched kernels, and amplitude hashes of QFT/random/Grover/VQE circuits).

Closes QuEST-Kit#717
@TysonRayJones

Copy link
Copy Markdown
Member

Wew nice work! 🎉 Great to see thought put into the call sites!

I have tailored the CI to run with the new instrinsic pathway, and will manually test on another Intel machine.

Interestingly, this diff reveals two kinds of callers of (insert|get)Bits() - those which know the qubit list will be unconditionally sorted (and can ergo always make use of the instrinsic when available), and those which have to check (and so can only maybe make use of the intrinsic).

  • Unconditional example (here):
     qindex qubitsPosMask = getBitMask(sortedQubits.data(), numQubitBits);
    
     for (qindex n=0; n<numIts; n++) {
    
         qindex i0 = insertBitsWithMaskedValuesAndPosMask(n, qubitStateMask, qubitsPosMask, sortedQubits.data(), numQubitBits);
  • Conditional example (here)
     qindex qubitsPosMask = getBitMask(qubits.data(), numBits);  // loop-invariant: hoisted out of the per-amplitude loop
     bool qubitsSorted = isStrictlyIncreasing(qubits.data(), numBits);  // likewise loop-invariant (order checked once per gate)
    
     for (qindex n=0; n<numIts; n++) {
         qreal prob = norm(amps[n]);
         qindex i = concatenateBits(qureg.rank, n, qureg.logNumAmpsPerNode);
    
         qindex j = (qubitsSorted ? getValueOfBitsFromSortedPosMask(i, qubitsPosMask, qubits.data(), numBits) : getValueOfBits(i, qubits.data(), numBits));

This is a very important difference. In the un-conditional case, everything can be fully inlined, and getValueOfBitsFromSortedPosMask could become a | _pdep_u64(b, ~c) at compile-time. But in the conditional case, we have a tension. We can either...

  • (current design) Decide whether or not to (attemptedly) leverage the intrinsic within each iteration of the loop. This ternary introduces branching in a hot-loop, though since its value is fixed (which we can make more explicit using const), I believe that either...
    • The compiler will unroll the entire loop into two loops, with distinct bodies based on qubitsSorted.
    • Otherwise, branch-prediction will become so simple/quick, that after a few iterations, the ternary will be free.
  • unroll the loop ourselves (very ugly)
  • prior resolve a function pointer (disables inlining)
  • use some templating trick
  • unconditionally sort the qubits and change the algorithm logic

I believe the current design is actually simultaneously the cleanest and best performing (when the compiler isn't stoopid, and when the function doesn't have a natural always-sorted formulation). I'm not above mostly for my own reference/history!

I will make some cleanup changes to this PR, and possibly make an extension (e.g. an O(1) fallback alternative to the instrinsic). No more changes are necessary for unitaryHACK however! 🎉 Please comment on issue #717 so I can assign it to you!

// use template param to compile-time unroll loop in insertBits()
SET_VAR_AT_COMPILE_TIME(int, numBits, NumQubits, qubitInds.size());

qindex qubitsPosMask = getBitMask(sortedQubitInds.data(), numBits); // loop-invariant: hoisted out of the per-amplitude loop

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: remove // loop-invariant: hoisted out of the per-amplitude loop comments

* reuses the original unrolled scalar routines and stays byte-identical.
*/

INLINE qindex insertBitsWithMaskedValuesAndPosMask(qindex number, qindex valueMask, [[maybe_unused]] qindex posMask, [[maybe_unused]] const int* bitInds, [[maybe_unused]] int numBits) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: consider renaming (perhaps the existing insertBitsWithMaskedValues should be renamed too for better disambiguation)

#ifdef QUEST_BITWISE_USE_BMI2
return valueMask | (qindex) _pdep_u64((unsigned long long) number, ~ (unsigned long long) posMask);
#else
return valueMask | insertBits(number, bitInds, numBits, 0);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: consider replacing this with O(1) bespoke masked logic to make bitInds,numBits always redundant


// Checked once per gate (loop-invariant), never per amplitude: getValueOfBits is order-sensitive,
// so the PEXT path above is valid only when bitInds are strictly increasing.
INLINE bool isStrictlyIncreasing(const int* bitInds, int numBits) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: move this to utilities (no need to inline; it's called once per backend function)


qindex ctrlsPosMask = getBitMask(sortedCtrls.data(), numCtrlBits); // loop-invariant: hoisted out of the per-amplitude loop
qindex targsPosMask = getBitMask(targs.data(), numTargBits); // loop-invariant: hoisted out of the per-amplitude loop
bool targsSorted = isStrictlyIncreasing(targs.data(), numTargBits); // likewise loop-invariant (order checked once per gate)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Mark targsSorted as const to improve compiler-chance of loop unrolling


// t = value of targeted bits, which may be in the prefix substate
qindex t = getValueOfBits(i, targs.data(), numTargBits);
qindex t = (targsSorted ? getValueOfBitsFromSortedPosMask(i, targsPosMask, targs.data(), numTargBits) : getValueOfBits(i, targs.data(), numTargBits));

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: add comment (and to other invocations) that we anticipate the compiler to unroll the loop(branch) into branched(loops), or otherwise that branch prediction will be so good as to remove all hot-loop-branching penalty

Comment thread tests/unit/bitwise.cpp

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: bitwise is not an API module, so we probably will not expose a dedicated test file like this. No need when the existing operator tests critically depend upon the bitwise function's correctness

Comment thread tests/unit/CMakeLists.txt
Comment on lines +22 to +36
# When the user opts in via QUEST_ENABLE_BMI2, compile only the issue-#717 bitwise test with -mbmi2 so
# it exercises the actual PEXT/PDEP path; otherwise (the default) it exercises the scalar fallback. The
# assertions hold identically either way. CMP0118: source properties are visible to the target's
# directory, so set it with explicit TARGET_DIRECTORY for the parent-scope 'tests' target.
if (QUEST_ENABLE_BMI2)
include(CheckCXXCompilerFlag)
check_cxx_compiler_flag("-mbmi2" QUEST_TEST_SUPPORTS_MBMI2)
if (QUEST_TEST_SUPPORTS_MBMI2)
set_source_files_properties(
bitwise.cpp
TARGET_DIRECTORY tests
PROPERTIES COMPILE_OPTIONS "-mbmi2"
)
endif()
endif() No newline at end of file

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: redundant if we remove bitwise.cpp test

Comment thread CMakeLists.txt
# instead supplies their own -march=native still gets the fast path on their own CPU.
if (QUEST_ENABLE_BMI2)
include(CheckCXXCompilerFlag)
check_cxx_compiler_flag("-mbmi2" QUEST_COMPILER_SUPPORTS_MBMI2)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QUEST_COMPILER_SUPPORTS_MBMI2 is a local var, so should become quest_compiler_supports_bmi2

Comment thread CMakeLists.txt
if (QUEST_COMPILER_SUPPORTS_MBMI2)
target_compile_options(QuEST PRIVATE $<$<COMPILE_LANGUAGE:CXX>:-mbmi2>)
else()
message(WARNING "QUEST_ENABLE_BMI2=ON but the compiler does not accept -mbmi2; building the scalar fallback.")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: error instead of warn here, unless we change BMI2 to be ON by default

Comment thread CMakeLists.txt
option(
QUEST_ENABLE_BMI2
"Whether QuEST will accelerate CPU bit gather/scatter with x86 BMI2 (PEXT/PDEP) intrinsics (issue #717). Turned OFF by default; when ON, the resulting binary requires a BMI2-capable CPU at runtime."
OFF

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: consider enabling by default when detectedly supported by compiler

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants