Optimise getValueOfBits and insertBits with BMI2 PEXT/PDEP (#717) by nez0b · Pull Request #796 · QuEST-Kit/QuEST

nez0b · 2026-06-15T13:00:13Z

Closes #717.

Summary

This adds guarded x86 BMI2 fast paths for QuEST's hot bit gather/scatter helpers, it hoists the loop-invariant work out of the 2^N statevector loops so each call collapses to a single instruction:

insertBitsWithMaskedValues(...) (the gate-application path): I build the position mask once per
gate and use _pdep_u64, instead of rebuilding it on every amplitude.
getValueOfBits(...) (measurement / diagonal-matrix / projector): _pext_u64 when the qubits are
strictly increasing, falling back to the original loop otherwise.
CMakeLists.txt adds a QUEST_ENABLE_BMI2 option (off by default): a default build stays
portable scalar — no BMI2 in the binary, so it can't SIGILL on a pre-BMI2 CPU — and
-DQUEST_ENABLE_BMI2=ON wires -mbmi2 (PEXT/PDEP) through the library (which then requires a
BMI2-capable CPU at runtime).

The earlier PRs rebuilt the mask / re-checked sortedness inside
the per-amplitude function; here those are loop-invariant and computed once.

A note on impact. In practice this is a modest optimisation, and I want to be upfront about
that. The per-call speedup is large (6–12× in cache), but non-trivial state-vector simulation is
memory-bandwidth-bound, so most of that win is hidden: end-to-end I measure only ~1.0–1.3×
single-threaded, collapsing toward 1× (occasionally below) once the state is DRAM-bound or the run
saturates memory bandwidth across threads. It is probably not a large win for
big production runs.

Benchmarks

"Per call" means one invocation of the bit gather/scatter helper (getValueOfBits /
insertBitsWithMaskedValues). The plot below isolates that instruction. it's the bit-op run in a tight
loop with the result kept in a register, no statevector traffic.

Pure bit-op speedup: PDEP/PEXT are about 6–12× faster than the scalar loops in cache (k=6, single thread),
compressing to ~4–7× at the DRAM wall (one thread can't saturate memory):

That isolated number is the ceiling, not what a gate sees — inside a real kernel the bit-op only
computes an index, and the strided complex-amplitude load/store it gates dominates the cost (see Thread
scaling below). The three views form one memory-roofline ladder — isolated instruction ~6–12× →
inside a gate ~2.5× → whole circuit ~1.0–1.3× — each diluted by progressively more memory traffic.

End-to-end on whole circuits (AI assisted benchmark generation) (Xeon 6448H, single thread, all bit-identical to the scalar build):

circuit	nq=12	nq=16
QFT (controlled-phase ladder)	1.26×	1.01×
random / supremacy	1.16×	1.27×
Grover (many-controlled)	1.17×	1.27×
VQE + measurement	1.15×	1.28×

So the end-to-end win is modest — roughly 1.0–1.3× single-threaded, biggest where index math
dominates (controlled-phase-heavy circuits) on cache-resident states, and it shrinks toward 1× (and
occasionally below — e.g. 0.84× at nq=20 / 32 threads) once the state is DRAM-bound or threads
saturate bandwidth.

Thread scaling — the bit-op inside a real gate kernel (Xeon 6448H, L3-resident q=18). Unlike the
isolated plot above, each call here also reads/writes the strided complex amplitudes it indexes (~64 B of
scattered memory per call vs the plot's single sequential 8-byte read), so memory, not the instruction,
sets the pace:

threads	1	8	32	64	128
PDEP (scatter)	2.53×	1.30×	1.15×	1.11×	1.01×
PEXT (sorted gather)	3.89×	3.95×	2.52×	2.02×	1.22×

Even at 1 thread this is only 2.5–3.9× — not the ~7–8× the isolated plot shows at q=18 — because the
strided amplitude traffic already dominates; it then erodes toward ~1× by 128 threads as bandwidth
saturates. (The 1–32-thread runs are pinned to one idle-gated socket; 64/128 spill across this shared
node's other sockets, so read those two columns as trend rather than precise figures.)

I also tried VPSHUFBITQMB (AVX-512)

Because getValueOfBits can be handed unsorted qubits, I benchmarked the AVX-512 arbitrary-order
gather (VPSHUFBITQMB, BITALG) as an alternative to PEXT. Per-call, in cache (gather):

gather, in-cache	ns/call	vs scalar	needs
scalar loop	4.36	1.0×	—
PEXT (sorted only)	0.49	8.9×	BMI2
VPSHUFBITQMB (any order)	0.69	6.3×	AVX-512 BITALG
VPSHUFBITQMB ×8 (any order, batched)	0.41	10.6×	AVX-512 BITALG + loop rewrite

I didn't use it, for three reasons: (1) it needs AVX-512 BITALG (Ice Lake-SP+ / Zen 4+), so it would
need runtime CPU dispatch rather than a compile-time guard; (2) for the common sorted case PEXT is
already faster per call; (3) the only thing it uniquely buys — the unsorted gather — is a tiny slice
of the work. Instrumenting the four circuit families, scatter (PDEP) outweighs gather by ~58:1,
and the unsorted part is ~0.5% of bitwise work. BMI2-only keeps the diff small and portable for the
whole win.

Tests

tests/unit/bitwise.cpp — checks the new mask-accepting helpers are bit-identical to the original
getValueOfBits / insertBitsWithMaskedValues over exhaustive-small and randomised inputs, plus
deterministic boundary cases at bits 31/32/61/62/63 (incl. the int64 sign bit), and
isStrictlyIncreasing. Built with -DQUEST_ENABLE_BMI2=ON it's compiled with -mbmi2, so it
exercises the real PEXT/PDEP path; 41,849 assertions pass (and pass equally on the scalar fallback).
examples/automated/benchmark_bitwise_bmi2.cpp — a quick benchmark that prints the timings and
which path was compiled in, so a non-x86 / non--mbmi2 CI runner just prints the scalar timings
rather than hitting SIGILL. Runs in well under a second.

AI assistance

I used an AI assistant to help survey the candidate instructions (PEXT/PDEP, VPSHUFBITQMB, GFNI), draft
the hoisting and the compile guards, and assemble the benchmarks. I reviewed the diff myself, kept the
arbitrary-order semantics of getValueOfBits, and ran the validation above before opening this.

Closes #717

…#717) Replace the per-amplitude looped bit gather/scatter in the CPU statevector/ density-matrix kernels with x86 BMI2 PEXT/PDEP, hoisting the loop-invariant masks out of the 2^N loops so each per-amplitude call becomes one instruction: - insertBitsWithMaskedValues sites: compute the position mask once per gate and use _pdep_u64 (order-invariant scatter; unconditionally correct). - getValueOfBits sites: _pext_u64 when the qubits are strictly increasing, falling back to the original scalar loop otherwise (order is preserved). Portability: BMI2 is opt-in via a new QUEST_ENABLE_BMI2 CMake option, OFF by default, so a default build stays portable scalar (no BMI2 in the binary, no SIGILL on pre-BMI2 CPUs). Enabling it wires -mbmi2 through the library; the intrinsics are additionally guarded to x86 host TUs (never CUDA/HIP device code). The scalar fallback is byte-identical, and QUEST_BITWISE_FORCE_SCALAR forces it on a BMI2-capable host. Tests/benchmark: tests/unit/bitwise.cpp asserts the new helpers are bit-identical to the originals over exhaustive-small, randomised, and boundary inputs (bits 31/32/61/62/63 incl. the int64 sign bit); examples/automated adds a cross-platform benchmark that prints timings and which path was compiled in. Bit-identical to the scalar path (verified by unit tests, the QuEST suite for the touched kernels, and amplitude hashes of QFT/random/Grover/VQE circuits). Closes QuEST-Kit#717

TysonRayJones · 2026-06-15T19:48:42Z

Wew nice work! 🎉 Great to see thought put into the call sites!

I have tailored the CI to run with the new instrinsic pathway, and will manually test on another Intel machine.

Interestingly, this diff reveals two kinds of callers of (insert|get)Bits() - those which know the qubit list will be unconditionally sorted (and can ergo always make use of the instrinsic when available), and those which have to check (and so can only maybe make use of the intrinsic).

Unconditional example (here):

 qindex qubitsPosMask = getBitMask(sortedQubits.data(), numQubitBits);

 for (qindex n=0; n<numIts; n++) {

     qindex i0 = insertBitsWithMaskedValuesAndPosMask(n, qubitStateMask, qubitsPosMask, sortedQubits.data(), numQubitBits);

Conditional example (here)

 qindex qubitsPosMask = getBitMask(qubits.data(), numBits);  // loop-invariant: hoisted out of the per-amplitude loop
 bool qubitsSorted = isStrictlyIncreasing(qubits.data(), numBits);  // likewise loop-invariant (order checked once per gate)

 for (qindex n=0; n<numIts; n++) {
     qreal prob = norm(amps[n]);
     qindex i = concatenateBits(qureg.rank, n, qureg.logNumAmpsPerNode);

     qindex j = (qubitsSorted ? getValueOfBitsFromSortedPosMask(i, qubitsPosMask, qubits.data(), numBits) : getValueOfBits(i, qubits.data(), numBits));

This is a very important difference. In the un-conditional case, everything can be fully inlined, and getValueOfBitsFromSortedPosMask could become a | _pdep_u64(b, ~c) at compile-time. But in the conditional case, we have a tension. We can either...

(current design) Decide whether or not to (attemptedly) leverage the intrinsic within each iteration of the loop. This ternary introduces branching in a hot-loop, though since its value is fixed (which we can make more explicit using const), I believe that either...
- The compiler will unroll the entire loop into two loops, with distinct bodies based on qubitsSorted.
- Otherwise, branch-prediction will become so simple/quick, that after a few iterations, the ternary will be free.
unroll the loop ourselves (very ugly)
prior resolve a function pointer (disables inlining)
use some templating trick
unconditionally sort the qubits and change the algorithm logic

I believe the current design is actually simultaneously the cleanest and best performing (when the compiler isn't stoopid, and when the function doesn't have a natural always-sorted formulation). I'm not above mostly for my own reference/history!

I will make some cleanup changes to this PR, and possibly make an extension (e.g. an O(1) fallback alternative to the instrinsic). No more changes are necessary for unitaryHACK however! 🎉 Please comment on issue #717 so I can assign it to you!

TysonRayJones · 2026-06-15T19:49:56Z

    // use template param to compile-time unroll loop in insertBits()
    SET_VAR_AT_COMPILE_TIME(int, numBits, NumQubits, qubitInds.size());

+    qindex qubitsPosMask = getBitMask(sortedQubitInds.data(), numBits);  // loop-invariant: hoisted out of the per-amplitude loop


TODO: remove // loop-invariant: hoisted out of the per-amplitude loop comments

TysonRayJones · 2026-06-15T19:51:44Z

+ * reuses the original unrolled scalar routines and stays byte-identical.
+ */
+
+INLINE qindex insertBitsWithMaskedValuesAndPosMask(qindex number, qindex valueMask, [[maybe_unused]] qindex posMask, [[maybe_unused]] const int* bitInds, [[maybe_unused]] int numBits) {


TODO: consider renaming (perhaps the existing insertBitsWithMaskedValues should be renamed too for better disambiguation)

TysonRayJones · 2026-06-15T19:52:43Z

+#ifdef QUEST_BITWISE_USE_BMI2
+    return valueMask | (qindex) _pdep_u64((unsigned long long) number, ~ (unsigned long long) posMask);
+#else
+    return valueMask | insertBits(number, bitInds, numBits, 0);


TODO: consider replacing this with O(1) bespoke masked logic to make bitInds,numBits always redundant

TysonRayJones · 2026-06-15T19:53:22Z

+
+// Checked once per gate (loop-invariant), never per amplitude: getValueOfBits is order-sensitive,
+// so the PEXT path above is valid only when bitInds are strictly increasing.
+INLINE bool isStrictlyIncreasing(const int* bitInds, int numBits) {


TODO: move this to utilities (no need to inline; it's called once per backend function)

TysonRayJones · 2026-06-15T19:54:00Z


+    qindex ctrlsPosMask = getBitMask(sortedCtrls.data(), numCtrlBits);  // loop-invariant: hoisted out of the per-amplitude loop
+    qindex targsPosMask = getBitMask(targs.data(), numTargBits);  // loop-invariant: hoisted out of the per-amplitude loop
+    bool targsSorted = isStrictlyIncreasing(targs.data(), numTargBits);  // likewise loop-invariant (order checked once per gate)


TODO: Mark targsSorted as const to improve compiler-chance of loop unrolling

TysonRayJones · 2026-06-15T19:54:59Z


        // t = value of targeted bits, which may be in the prefix substate
-        qindex t = getValueOfBits(i, targs.data(), numTargBits);
+        qindex t = (targsSorted ? getValueOfBitsFromSortedPosMask(i, targsPosMask, targs.data(), numTargBits) : getValueOfBits(i, targs.data(), numTargBits));


TODO: add comment (and to other invocations) that we anticipate the compiler to unroll the loop(branch) into branched(loops), or otherwise that branch prediction will be so good as to remove all hot-loop-branching penalty

TysonRayJones · 2026-06-15T19:56:01Z

TODO: bitwise is not an API module, so we probably will not expose a dedicated test file like this. No need when the existing operator tests critically depend upon the bitwise function's correctness

TysonRayJones · 2026-06-15T19:56:37Z

+# When the user opts in via QUEST_ENABLE_BMI2, compile only the issue-#717 bitwise test with -mbmi2 so
+# it exercises the actual PEXT/PDEP path; otherwise (the default) it exercises the scalar fallback. The
+# assertions hold identically either way. CMP0118: source properties are visible to the target's
+# directory, so set it with explicit TARGET_DIRECTORY for the parent-scope 'tests' target.
+if (QUEST_ENABLE_BMI2)
+  include(CheckCXXCompilerFlag)
+  check_cxx_compiler_flag("-mbmi2" QUEST_TEST_SUPPORTS_MBMI2)
+  if (QUEST_TEST_SUPPORTS_MBMI2)
+    set_source_files_properties(
+      bitwise.cpp
+      TARGET_DIRECTORY tests
+      PROPERTIES COMPILE_OPTIONS "-mbmi2"
+    )
+  endif()
+endif()


TODO: redundant if we remove bitwise.cpp test

TysonRayJones · 2026-06-15T19:58:20Z

+# instead supplies their own -march=native still gets the fast path on their own CPU.
+if (QUEST_ENABLE_BMI2)
+  include(CheckCXXCompilerFlag)
+  check_cxx_compiler_flag("-mbmi2" QUEST_COMPILER_SUPPORTS_MBMI2)


QUEST_COMPILER_SUPPORTS_MBMI2 is a local var, so should become quest_compiler_supports_bmi2

TysonRayJones · 2026-06-15T19:59:40Z

+  if (QUEST_COMPILER_SUPPORTS_MBMI2)
+    target_compile_options(QuEST PRIVATE $<$<COMPILE_LANGUAGE:CXX>:-mbmi2>)
+  else()
+    message(WARNING "QUEST_ENABLE_BMI2=ON but the compiler does not accept -mbmi2; building the scalar fallback.")


TODO: error instead of warn here, unless we change BMI2 to be ON by default

TysonRayJones · 2026-06-15T19:59:55Z

+option(
+  QUEST_ENABLE_BMI2
+  "Whether QuEST will accelerate CPU bit gather/scatter with x86 BMI2 (PEXT/PDEP) intrinsics (issue #717). Turned OFF by default; when ON, the resulting binary requires a BMI2-capable CPU at runtime."
+  OFF


TODO: consider enabling by default when detectedly supported by compiler

nez0b and others added 2 commits June 15, 2026 20:25

Tailor CI

ce7c3ae

TysonRayJones reviewed Jun 15, 2026

View reviewed changes

nez0b mentioned this pull request Jun 15, 2026

Optimise getValueOfBits and insertBits #717

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimise getValueOfBits and insertBits with BMI2 PEXT/PDEP (#717)#796

Optimise getValueOfBits and insertBits with BMI2 PEXT/PDEP (#717)#796
nez0b wants to merge 2 commits into
QuEST-Kit:develfrom
nez0b:bmi2-bitwise-717

nez0b commented Jun 15, 2026

Uh oh!

TysonRayJones commented Jun 15, 2026

Uh oh!

TysonRayJones Jun 15, 2026

Uh oh!

TysonRayJones Jun 15, 2026

Uh oh!

TysonRayJones Jun 15, 2026

Uh oh!

TysonRayJones Jun 15, 2026

Uh oh!

TysonRayJones Jun 15, 2026

Uh oh!

TysonRayJones Jun 15, 2026

Uh oh!

TysonRayJones Jun 15, 2026

Uh oh!

TysonRayJones Jun 15, 2026

Uh oh!

TysonRayJones Jun 15, 2026

Uh oh!

TysonRayJones Jun 15, 2026

Uh oh!

TysonRayJones Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nez0b commented Jun 15, 2026

Summary

Benchmarks

I also tried VPSHUFBITQMB (AVX-512)

Tests

AI assistance

Uh oh!

TysonRayJones commented Jun 15, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants