Skip to content

Heap profiling + tcmalloc-style telemetry parity (Phases 2–11)#857

Open
jayakasadev wants to merge 68 commits into
microsoft:mainfrom
jayakasadev:main
Open

Heap profiling + tcmalloc-style telemetry parity (Phases 2–11)#857
jayakasadev wants to merge 68 commits into
microsoft:mainfrom
jayakasadev:main

Conversation

@jayakasadev

Copy link
Copy Markdown
Contributor

Summary

Mega-PR landing the full heap-profiling + tcmalloc-style telemetry stack from the jayakasadev/snmalloc development fork onto microsoft/snmalloc:main. 65 squash commits, 113 files, +27,141 / -48 lines.

Caveat up front: this is intentionally large. The maintainer's preferred chunking can shape a follow-up split if review-by-phase is preferred — the per-phase commits are listed below so they can be cherry-picked individually if needed. Upstream PR #852 (rust-heap-profiling-infra) is the partial Phase 2 predecessor of this work and is superseded by this PR if merged.

Phases shipped

  • Phase 2 — C++ sampling infrastructure (PRs Fix typo in threadalloc.h #2/Windows 32bit build #3/Merge changes required for using snmalloc in FreeBSD libc #4 on fork)
    • Per-thread Poisson sampler (Sampler class) with bytes_until_sample_ countdown
    • Lock-free SampledList + pre-allocated node pool
    • Re-entrancy guard for backtrace()-style stack walkers
    • Pluggable stack walker abstraction (FP-walk default, libunwind/backtrace/CaptureStackBackTrace opt-in)
    • LazyArrayClientMetaDataProvider primitive (zero slab-meta bytes when profile inactive)
    • aarch64 PAC handling on Apple Silicon
  • Phase 3 — Allocation hooks + C exports (PRs Address space constrained option #5–9)
    • ProfilingConfig with lazy provider
    • Single-chokepoint instrumentation: snmalloc::alloc(size_t) + Allocator::dealloc H1–H4 sites
    • Covers realloc / calloc / aligned_alloc / posix_memalign / large alloc / GWP-ASan secondary / slow-path recursion
    • SNMALLOC_PROFILE CMake gate + CI matrix entries
  • Phase 4 — Rust snapshot API (PRs Hardening allocator #10–16)
    • profiling Cargo feature + snmalloc-sys FFI declarations
    • HeapProfile + BtSample + snapshot()/set_sampling_rate()
    • write_flamegraph() folded-stack output
    • Dump-time symbolicator (backtrace crate)
    • Runtime config (env vars + SnMalloc::configure_profiling())
    • Speedscope + Inferno round-trip tests
  • Phase 5 — Streaming allocation mode (PRs Fix the condition on when to allocate a new block. #17, Add malloc tests #20)
    • AllocationSampleList C++ + ReportMalloc broadcast
    • sn_rust_profile_start/stop C exports + Rust ProfilingSession
  • Phase 6 — pprof output (PRs Make internal symbols hidden. #18, Place the next pointer at a different place on every object. #21)
    • HeapProfile::write_pprof() + pprof proto encoding
    • go tool pprof integration test
  • Phase 7 — Performance hardening (PRs Make internal symbols hidden. #19, Pal zero bug alignment #22–24)
    • Cache-line placement of bytes_until_sample_
    • Criterion bench suite (snmalloc-rs/benches/profile_bench.rs)
    • Snapshot-under-churn TSan + ASan stress test
    • CI matrix expansion (Linux + macOS, gcc + clang, SNMALLOC_PROFILE=ON/OFF)
    • Profile fast-path overhead measured at ~0% within bench noise (docs/heap-profiling-benchmarks.md)
  • Phase 8 — Documentation (PRs Made the malloc tests run on Windows. #25, Tweaks to end bounds checking. #26)
    • README profiling section + sampling-rate guidance + viewer tooling
    • Rust doc examples for snapshot()/write_flamegraph()/write_pprof()
    • Release notes deferred until this PR's review concludes
  • Phase 9 — Allocator-side telemetry parity with tcmalloc (PRs ds/bits contains decidedly not bit-like things #42, CMake Header-Only Target #46, Add instructions on how to use the header-only library #48–53)
    • FullAllocStats typed struct + C ABI + Rust binding
    • Per-thread frontend cache stats (fast/slow path + remote + msg-queue counters)
    • Per-size-class histogram (live + cumulative alloc/dealloc, FULL tier only)
    • Backend fragmentation (mapped/committed/decommitted_to_os)
    • Sample lifetime histogram (log2 buckets, profile-gated)
    • Text dump API (snmalloc::dump_stats / SnMalloc::dump_stats, tcmalloc-style MALLOC: lines)
    • Runtime tunables (sample rate, decay rate, max local cache)
    • USE_SNMALLOC_STATSSNMALLOC_STATS rename (cleanup of dead aggregate_stats refs)
  • Phase 10 — PMU-backed CPU-microarch profiling (PRs CHERI Preparatory work #41, Expensive test property #43, Added error message to Windows Pal using VirtualAlloc. #44, Remove two unused functions. #47)
    • Hot-spot table API + lookup_alloc_site(addr) reverse lookup
    • Build-time SNMALLOC_LIKELY/UNLIKELY inventory dumper (scripts/dump_branch_hints.py)
    • PMU workflow docs (docs/profiling-pmu.md)
    • snmalloc-tools Rust crate (CLI joiner over perf record / perf c2c / perf script)
  • Phase 11 — Overhead reduction + polish (PRs Made the statistics print atexit #54–66)
    • Tiered stats: SNMALLOC_STATS_BASIC (≤ 2% overhead target) + SNMALLOC_STATS_FULL (≤ 20% target)
    • Batched counter updates at small_refill (Phase 11.8 / 11.9 / 11.12)
    • Cache-line padded backend atomics (Phase 11.10)
    • Symbolicate-aware HotSpotKey::CallSite filter
    • Vendor dump_branch_hints.py into snmalloc-sys/upstream/
    • Largebuddy free-chunk histogram into FullAllocStats.reserved[0..16]
    • Final bench (Apple M4 Pro): BASIC ≤ 1.02 on small_allocs / medium_allocs / mixed; FULL ≤ 1.20 on all

Final overhead summary

5-run mean ratios from snmalloc-rs/benches/stats_bench.rs and snmalloc-rs/benches/profile_bench.rs on Apple M4 Pro, release + fat-LTO:

Mode small_allocs medium_allocs mixed
SNMALLOC_PROFILE=ON (idle) 1.0036 0.9998 0.9925
SNMALLOC_PROFILE=ON (active, 512 KiB sample) 0.9983 0.9990 1.0026
SNMALLOC_STATS_BASIC=ON ~1.00 0.99 ~1.00
SNMALLOC_STATS_FULL=ON 1.164 1.094 1.091

All within target. Full numbers + methodology in docs/heap-profiling-benchmarks.md.

ABI

FullAllocStats C struct uses a SNMALLOC_FULL_STATS_VERSION field (currently 2) + reserved[64] for forward-compat. Wave-2 fields stay zero when their build flag is off; existing fields remain populated. The legacy SNMALLOC_STATS=ON flag is preserved as an alias for SNMALLOC_STATS_BASIC.

Test coverage

Full local sweep on Apple M4 Pro (CI minutes exhausted on fork; re-running here is gated by maintainer):

  • C++ ctest: 104/104 PASS (no long/stress jobs)
  • cargo test (no features): PASS
  • cargo test --features stats-basic: PASS
  • cargo test --features stats-full: PASS
  • cargo test --features profiling: PASS
  • cargo test --features profiling,symbolicate: PASS
  • cargo test --workspace (incl. snmalloc-tools): PASS

Review chunking suggestion

If the maintainer would prefer phase-by-phase landing, the squash commits listed in the commit history map 1:1 to fork PRs. Phase 2 and Phase 3 are the entry-points (everything else depends on the C++ sampling + hook infrastructure they introduce). After those land upstream, the remaining phases can land as independent PRs cherry-picked from this branch.

Commit list

(65 squashed commits — see the PR's commit tab for the chronological log; each commit corresponds to one fork-side PR.)

Introduces a per-slab client-meta provider that costs exactly one pointer
of inline metadata (sizeof(void*)) regardless of the slab's object count.
The backing T[] array is lazily materialised on the first get() call and
published via a double-checked compare-and-swap against an inline
stl::Atomic<T*>; concurrent first-touches resolve without a lock and the
losing thread decommits its temporary mapping with PAL::notify_not_using.

The lazy install path goes directly to DefaultPal (reserve + notify_using
<YesZero>) so it cannot recurse into user malloc, and the per-slab
overhead when never queried is one nullptr — appropriate for sampled
heap-profiling metadata that only a small fraction of slabs ever touch.

The primitive is purely additive: it is not yet wired into any Config and
no SNMALLOC_PROFILE gating is introduced (Phase 3 concerns). Existing
NoClientMetaDataProvider / ArrayClientMetaDataProvider, their call sites
in FrontendSlabMetadata::get_meta_for_object, and the global Config
selection are unchanged. Wiring this provider up will require threading
the per-slab object count from the pagemap MetaEntry through
get_meta_for_object to the new get(StorageType*, size_t, size_t) overload.

ClickUp: 86ahrfwmq
Introduces the StackWalker abstraction described in
.claude/research/heap-profiling/stack-walker.md as a new PAL header
(pal_stack_walker.h, included from pal/pal.h). This is the first concrete
piece of Phase 2.1 of the heap-profiling milestone (ClickUp 86ahzwhq5).

Walker capabilities:
- FramePointerWalker: pure dependent-load loop with per-frame validation
  (alignment, strict-monotonic FP, stack-range, sentinel null-FP). Reads
  fp[0] (saved FP) and fp[1] (saved LR) from canonical aarch64/x86_64
  frame headers. On aarch64, unconditionally strips Pointer-Authentication
  Code bits from the saved LR via ptrauth_strip on Apple and xpaclri
  (HINT #7) elsewhere -- both decode to a NOP on cores without
  FEAT_PAuth, so cost is zero on non-PAC hardware.
- POD thread_local stack-bounds cache populated lazily via
  pthread_get_stackaddr_np on macOS and pthread_getattr_np on Linux.
  Zero-initialised; no constructor, no __cxa_thread_atexit, no malloc on
  first access -- the only construction pattern provably reentrancy-safe
  from inside an allocator's sample path.
- NullStackWalker fallback for unsupported targets (Windows, FreeBSD,
  OpenEnclave, CHERI/Morello, non-x86_64/aarch64). Returns 0 frames.
- Async-signal-safe: no malloc, no locks, no syscalls, no TLS
  construction. Graceful degradation on broken FP chains.
- Selection at compile time via preprocessor macros. No CMake option in
  this commit (deferred -- see "what's NOT done" below).
- A free function snmalloc::profile::stack_walk() wraps the default
  walker for callers that don't need to pick one explicitly.

Supported arches: x86_64 + aarch64 on Linux + macOS.

Microbenchmark (src/test/perf/stack_walker_bench/):
- Recursive call-chain builder with NOINLINE + tail-call-prevention
  asm-barriers. Sweeps depths 2/4/8/16/32, takes min of 5 repeats per
  depth, reports total ns / ns-per-iter / ns-per-frame and a two-point
  slope estimate.
- Auto-discovered by the existing perf harness; added to
  TESTLIB_ONLY_TESTS so it shares an object library across fast/check
  flavours.
- Asserts ns/frame < 50 (5x headroom over the ~10 ns/frame design
  target). Skipped under --smoke and Debug builds.
- Measured on Apple Silicon M-series: ~0.5-1.0 ns/frame steady state
  (deepest depth 35 captured frames, total ~21 us / 1M iterations =
  20.6 ns/iter, slope 0.53 ns/frame). Well under the design target.

What is NOT done in this commit:
- The walker is NOT wired into any allocator path. No SNMALLOC_PROFILE
  gating exists yet; that lives in Phase 3.
- The matching CMake plumbing -- a SNMALLOC_PROFILE_STACK_WALKER
  option (fp / null / auto) and -fno-omit-frame-pointer injection for
  snmalloc TUs -- is left for a follow-up. The header today is
  controlled by SNMALLOC_PROFILE_STACK_WALKER_FP /
  SNMALLOC_PROFILE_STACK_WALKER_NULL preprocessor overrides plus an
  arch/OS auto-detection default.
- Stack-capture-at-sample-hit (ClickUp 86ahzwhq5's sibling 86ahzwhmh)
  is NOT included; it requires the Sampler from Phase 2.2.

Files:
- src/snmalloc/pal/pal_stack_walker.h (new, header-only)
- src/snmalloc/pal/pal.h (one #include line)
- src/test/perf/stack_walker_bench/stack_walker_bench.cc (new)
- CMakeLists.txt (one-word addition to TESTLIB_ONLY_TESTS)

ClickUp: 86ahzwhq5
#4)

Pure infrastructure for the heap-profiling milestone. Adds the per-thread
Poisson sampler, the SampledAlloc record + pre-allocated lock-free node
pool, the global lock-free intrusive list of currently-sampled allocations,
and the per-thread re-entrancy guard. Wires the FramePointerWalker from
Phase 2.1 into the sampler so a sample fire captures a stack at the
allocation site.

Purely additive: nothing is plumbed into snmalloc::alloc() / dealloc()
in this commit, no SNMALLOC_PROFILE gating yet (that is Phase 3 work),
and existing allocator behaviour is unchanged. All new code lives in
src/snmalloc/profile/, kept separate from src/snmalloc/pal/ because the
profiler is policy rather than platform abstraction.

Components:

- Sampler (sampler.h)
  Per-thread Poisson sampler. Fast path is one int64_t subtract + one
  signed-compare branch (~3-4 cycles). Slow path draws Exp(rate) via
  libm log on a doubles-in-(0,1] conversion of the xoshiro256** output;
  computes weight as `rate - bytes_until_sample + requested_size`
  (tcmalloc convention, bytes-of-request); acquires a node from the
  global NodePool; captures a stack via FramePointerWalker (skip=1);
  publishes on the global SampledList. First-sample bootstrap draws the
  initial countdown from Exp(rate) so the very first sample is unbiased
  -- the single most commonly-mishandled detail in DIY samplers.

- SampledAlloc (sampled_alloc.h)
  Cache-line aligned record holding alloc address, requested + allocated
  sizes, weight, the sampling interval that was in force at capture
  time (so a later set_sampling_rate doesn't mis-weight already-captured
  samples), tid, monotonic alloc_seq, captured stack frames, and an
  atomic NodeState. Stack depth knob defaults to 32 frames
  (SNMALLOC_PROFILE_STACK_FRAMES).

- NodePool (node_pool.h)
  Fixed-capacity lock-free Treiber stack of SampledAlloc nodes with a
  32-bit ABA tag packed into the high half of a 64-bit head word.
  Backing storage allocated directly via mmap / VirtualAlloc -- the
  profiler must never re-enter snmalloc's own allocator. acquire()
  returns nullptr and bumps a drop counter on exhaustion; callers
  silently skip the sample.

- SampledList (sampled_list.h)
  Lock-free intrusive singly-linked list. Tombstone bit packed into the
  low bit of `SampledAlloc::next` so liveness and link come from a
  single acquire-load. remove() is a CAS on the tombstone bit
  (linearisation point) followed by a best-effort linear unlink; lost
  unlink races leave the node as a tombstoned skip until the next walk
  reaps it. Cross-thread remove works because no thread ownership is
  implied -- whichever thread does the dealloc does the remove. No
  reclamation needed: node memory is owned by NodePool, not the list.

- ReentrancyGuard (reentrancy_guard.h)
  POD `thread_local uint8_t` (lives in .tbss, zero-initialised by the
  loader, no first-touch malloc, no __cxa_thread_atexit registration).
  RAII guard sets the flag on the sampler slow path so any transitive
  allocator call (e.g. glibc backtrace() lazy thread-cache init, or
  NodePool's first-call mmap) short-circuits via the fast-path
  `sampler_reentered()` check. Same pattern as pal_stack_walker.h's
  stack-bounds cache.

Test (src/test/func/profile_sampler/profile_sampler.cc):

  * NodePool basic: exhaustion, drop counter, alloc_seq monotonicity,
    full release+reacquire round-trip.
  * Reentrancy guard: TLS flag toggle + record_alloc short-circuit
    under an active guard.
  * SampledList single-threaded push/remove/snapshot + double-remove
    is a no-op + drain.
  * SampledList concurrent push (4 threads x 512 allocs) -- all 2048
    nodes observed.
  * SampledList concurrent push + cross-thread remove (4 threads
    pushing, 4 different threads removing the other thread's nodes) --
    list ends up empty.
  * Sampler first-sample bootstrap (100k fresh Samplers, each does one
    record_alloc(64) at T=4096) -- observed hit count 5-sigma window
    catches both the "all-zero" bug (deterministic bootstrap) and the
    "auto-sample-first" bug.
  * Sampler distribution (4M record_alloc(64) at T=512KiB) -- observed
    sample count and summed weight both within statistical tolerance
    of the analytic expectation.
  * Rate change (3M allocs at T=64KiB then 3M at T=256KiB) -- weight
    sums correct for both phases, hits inversely proportional to rate.
  * End-to-end: Sampler::record_alloc fires, captured node is reachable
    via SamplerGlobals::list().snapshot() with non-zero stack_depth.

Tickets: 86ahrfw19 (Sampler) 86ahrfw3f (SampledAlloc + NodePool)
         86ahrfw44 (SampledList) 86ahrfw58 (ReentrancyGuard)
         86ahrfw78 (unit tests) 86ahzwhmh (stack capture wiring)
         86ahzwhtq (weight contract)
- Add `option(SNMALLOC_PROFILE ...)` (default OFF) in CMakeLists.txt
  alongside SNMALLOC_COVERAGE.
- Add `add_as_define(SNMALLOC_PROFILE)` next to SNMALLOC_TRACING so the
  flag is plumbed through as a pure compile-time define on the snmalloc
  INTERFACE target. No source code reads it yet; alloc/dealloc hooks
  land in Phase 3.3.
- Add three CI matrix entries that mirror the existing "Traced Build"
  shape (build-only, reusable-cmake-build.yml, Release):
    * ubuntu-24.04 / gcc   / -DSNMALLOC_PROFILE=ON
    * ubuntu-24.04 / clang / -DSNMALLOC_PROFILE=ON
    * macos-15    / clang / -DSNMALLOC_PROFILE=ON

Verified locally on macOS arm64: configure + full build + all 86
ctest targets pass with -DSNMALLOC_PROFILE=ON, and the default (OFF)
build is byte-identical with respect to the new flag (define absent).
- New snmalloc::profile::record_dealloc<Config>(void*) free function in
  src/snmalloc/profile/record.h. Compiles to a no-op for configs whose
  ClientMeta is not LazyArrayClientMetaDataProvider<SampledAlloc-slot>,
  so the default snmalloc::Config sees zero cost.
- record_dealloc body splits into find_profile_slot (Config-specific
  pagemap walk) and clear_profile_slot (Config-agnostic atomic-CAS +
  SampledList::remove + NodePool::release), with the latter callable
  directly from tests.
- H1 hook installed at the dealloc waist in Allocator::dealloc(void*)
  (mem/corealloc.h:1025), gated by SNMALLOC_PROFILE. Fires before any
  existing dealloc logic so profile-side cleanup observes the live
  pagemap, and is itself safe under recursive entry via the per-thread
  ReentrancyGuard.
- record.h is intentionally lightweight; including commonconfig.h there
  would create a cycle (commonconfig -> mem/mem -> corealloc -> record).
  Instead corealloc.h forward-declares the template, and
  backend_helpers/backend_helpers.h pulls the full definition in once
  LazyArrayClientMetaDataProvider is visible.
- record_alloc stays a stub: full alloc-side wiring lands in Phase 3.3.
- New test src/test/func/profile_record/profile_record.cc covers the
  null-slot no-op, populated-slot drain, multi-threaded double-free
  CAS race, default-config compile-time no-op, ReentrancyGuard
  short-circuit and end-to-end libc::malloc/libc::free crash-freedom.
- Default (OFF) build remains byte-identical to pre-Phase-3.1: the H1
  call site is behind #ifdef SNMALLOC_PROFILE, and SNMALLOC_PROFILE=ON
  with the default NoClientMetaDataProvider Config inlines the
  if-constexpr branch into nothing (verified: same binary size for the
  default-config test executable in OFF vs ON builds).
- All existing tests pass under both -DSNMALLOC_PROFILE=OFF (88/88) and
  -DSNMALLOC_PROFILE=ON (88/88), -fast and -check variants.
- Add SNMALLOC_PROFILE-gated record_dealloc<Config>(msg) hook in
  Allocator::handle_dealloc_remote, just before the splice via
  dealloc_local_objects_fast on the destination thread. Catches the
  remote-ingest fast path -- the milestone-flagged critical free path
  for cross-thread frees.
- Reuses the Phase 3.1 record_dealloc / clear_profile_slot machinery
  unchanged; the atomic CAS in clear_profile_slot keeps H1 + H2
  idempotent w.r.t. the same pointer.
- Header surface unchanged; the SNMALLOC_PROFILE off build is
  byte-identical to pre-Phase-3.2.
- New func test profile_remote_dealloc covers: single-threaded
  baseline, H1/H2 sequential clear idempotence, a 4 producer + 4
  consumer cross-thread alloc/free stress test, and the default-config
  compile-time no-op contract.
- Hook the user-facing snmalloc::alloc(size_t), alloc<size>(),
  alloc(smallsizeclass_t), and alloc_aligned wrappers in
  global/globalalloc.h with a profile::record_alloc<Config>(...) call
  gated on #ifdef SNMALLOC_PROFILE.  One hook per wrapper covers all
  public alloc entry points -- malloc/calloc/realloc, operator new,
  jemalloc/Rust shims, BSD valloc/pvalloc, NetBSD reallocarr -- since
  they all funnel through these chokepoints.
- Wire the record_alloc body in profile/record.h: tick the per-thread
  Sampler (which already publishes the SampledAlloc on the global
  list), then install the node into the per-object profile slot via
  a new find_or_install_profile_slot<Config>(p) helper that forces
  the lazy backing array into existence on first sight.  Compile-time
  no-op when the config does not carry the lazy ProfileSlot provider.
- Add src/test/func/profile_e2e/profile_e2e.cc: an end-to-end test
  that defines its own profile-enabled Config via
  SNMALLOC_PROVIDE_OWN_CONFIG and exercises the full alloc + free
  pipeline.  Covers single-threaded rate accuracy, multi-threaded
  drain-to-empty, mixed entry-point coverage (malloc / calloc /
  aligned_alloc), and the rate=0 sampling-disabled fast path.

Default-Config build is byte-identical to Phase 3.2: every new code
path is gated on either #ifdef SNMALLOC_PROFILE or
config_has_profile_slot_v, so OFF builds and default-Config ON builds
see no behaviour change.
)

- Install H3 heap-profile hook in Allocator::dealloc_remote on the
  SecondaryAllocator branch (catches GWP-ASan / non-snmalloc pointers
  that bypass the snmalloc-owned pagemap).
- Install H4 heap-profile hook in Allocator::dealloc_remote_slow's
  lazy-init recursion lambda, immediately before the recursive
  a->dealloc(p). Pairs with H1 to keep the recursion-guard tight.
- Both hooks live entirely under #ifdef SNMALLOC_PROFILE; default
  Config OFF build is byte-identical to Phase 3.3.
- Both hooks reuse profile::record_dealloc<Config>; idempotence is
  guaranteed by the CAS in clear_profile_slot and the per-thread
  ReentrancyGuard. No new state machines, no new allocations on the
  free path.
- New test: src/test/func/profile_h3_h4/profile_h3_h4.cc.
  Triple- and quadruple-clear idempotence, nullptr robustness,
  fresh-thread remote-free stress, default-Config compile-time no-op.
- New test: src/test/func/profile_integration/profile_integration.cc.
  16 threads x 100k allocs x varied size ladder, ~50/50 same-thread
  vs cross-thread free, plus a one-producer-many-consumers stress.
  Asserts sample count within 6 sigma of Poisson expectation,
  post-free leak <= documented tolerance (<= 1% + 4), and that the
  global SampledList drains to zero. Sampling rate (128 KiB) sized
  so expected samples stay well below the NodePool capacity ceiling.
- Wires ticket 86ahrfx9g (multi-threaded alloc + cross-thread dealloc
  integration stress).
- Observed teardown-straggler ratio improves from ~1/1250 in the
  Phase 3.3 8-thread e2e test to ~1/4000 in the new 16-thread
  integration test, a ~3x reduction.
- Expose `sn_rust_profile_*` C ABI surface in src/snmalloc/override/rust.cc:
  supported, set_sampling_rate, get_sampling_rate, snapshot_begin,
  snapshot_count, snapshot_get, snapshot_end. New header
  src/snmalloc/override/rust_profile.h defines SnRustProfileRawSample
  (alloc_ptr, requested_size, allocated_size, weight, stack_depth, stack)
  with SNMALLOC_PROFILE_STACK_FRAMES matching the Phase 2 sampled_alloc.h
  constant.
- When SNMALLOC_PROFILE=OFF every export except `supported` is a stub
  returning zero / nullptr / false. Symbols are always linkable so the
  Rust crate's FFI does not need #[cfg] gating in extern blocks.
- When SNMALLOC_PROFILE=ON the bodies delegate to existing Phase 2 / 3
  machinery (Sampler::{set,get}_sampling_rate, SampledList::snapshot,
  SampledList::debug_count). No new C++ infrastructure introduced.
- Add `profiling` cargo feature to snmalloc-sys and the higher-level
  snmalloc-rs crate. The feature passes SNMALLOC_PROFILE=ON to cmake
  (or SNMALLOC_PROFILE=1 to the cc backend) and exposes
  SnRustProfileRawSample plus the sn_rust_profile_* extern declarations
  in snmalloc-sys/src/lib.rs.
- Cover the FFI surface with a small Rust smoke-test module
  (#[cfg(feature = "profiling")]) that exercises supported(),
  the sampling-rate roundtrip, and the snapshot lifecycle.
- No Rust-side safe wrapper yet -- that is Phase 4.1.

Verified:
- ctest --test-dir build (SNMALLOC_PROFILE=OFF): 96/96 passed.
- ctest --test-dir build-profile (SNMALLOC_PROFILE=ON): 96/96 passed.
- cargo test --all (no profiling feature): 12 passed across all crates.
- cargo test --all --features profiling: 15 passed across all crates
  (4 baseline snmalloc-sys + 3 new profile tests + everything else).
….1) (#11)

- New snmalloc-rs/src/profile.rs: idiomatic safe wrapper over the
  sn_rust_profile_* FFI surface from Phase 4.0.
- HeapProfile: owned, cloneable snapshot of live sampled allocations
  with len/is_empty/samples accessors plus u128 total_allocated_bytes
  and total_requested_bytes aggregators (saturating math, divide-by-
  zero-safe).
- BtSample: per-allocation record with alloc_ptr, requested_size,
  allocated_size, weight, and Vec<*const u8> stack frames.  Send +
  Sync via unsafe impls (raw pointers used opaquely, never deref'd).
- SnMalloc::snapshot / set_sampling_rate / sampling_rate /
  profiling_supported: thin methods on the existing global allocator
  type.  snapshot() uses an internal RawSnapshotGuard whose Drop
  releases the FFI handle even on panic mid-collection.
- snmalloc-sys/src/lib.rs: drop the #[cfg(feature = "profiling")]
  gate on the SnRustProfileRawSample struct and the
  sn_rust_profile_* extern block.  The C symbols are unconditional
  stubs when SNMALLOC_PROFILE is off, so the Rust bindings should be
  too -- this lets the safe wrapper present a uniform API in both
  feature-on and feature-off builds (empty profile, sampling_rate
  fixed at 0, profiling_supported() returns false).
- snmalloc-rs/src/lib.rs: expose the new profile module + re-export
  HeapProfile / BtSample.
- snmalloc-rs/tests/profile_snapshot.rs: integration tests covering
  feature-off quiescence (snapshot empty, rate fixed at 0,
  supported() == false), the sampling-rate round-trip when supported,
  and a #[ignore]'d live-sampling end-to-end test.
- The live-sampling test is ignored because the rust.cc shim is
  built with the default snmalloc::Config (NoClientMetaDataProvider),
  which makes config_has_profile_slot_v false and the alloc hook a
  compile-time no-op.  Wiring the Rust shim to use
  LazyArrayClientMetaDataProvider<ProfileSlot> is Phase 4.2 -- the
  Phase 4.1 ticket explicitly forbids modifying rust.cc /
  rust_profile.h.  See the ignore reason on live_sampling_run for
  the full path.
- All 12 snmalloc-rs unit tests, 4 (+ 1 ignored) integration tests,
  4 snmalloc-sys rust_tests, and the lib doc test pass with both
  feature off and feature on.  All 74 C++ ctest cases continue to
  pass in both SNMALLOC_PROFILE=ON and OFF build dirs.
…test (Phase 4.2) (#12)

- src/snmalloc/override/rust.cc: when SNMALLOC_PROFILE is defined,
  predeclare snmalloc::Config as
    StandardConfigClientMeta<LazyArrayClientMetaDataProvider<
      std::atomic<profile::SampledAlloc*>>>
  and define SNMALLOC_PROVIDE_OWN_CONFIG before the snmalloc.h /
  malloc.cc includes.  This flips config_has_profile_slot_v<Config>
  to true so the alloc/dealloc hooks in profile/record.h emit real
  samples on the rust shim's allocation paths.  When SNMALLOC_PROFILE
  is undefined the file is byte-identical to its pre-Phase-4.2 form.
- snmalloc-rs/tests/profile_snapshot.rs: drop the Phase-4.2 #[ignore]
  on live_sampling_run; the test now exercises the full pipeline,
  asserts the live snapshot count lies within a 6-sigma Poisson
  envelope of the expected sample count, and verifies the snapshot
  drains after every allocation is freed.  Header comment updated to
  match the new wiring.
- Verified: C++ 96/96 ctest pass with SNMALLOC_PROFILE=OFF; 96/96
  pass with SNMALLOC_PROFILE=ON.  Rust 12+1+5+4 tests pass with the
  profiling feature off; with the feature on the same suite plus
  three snmalloc-sys profile tests (totalling 12+1+5+7) pass and
  live_sampling_run observes ~1574 samples (expected ~1562, +/-6
  sigma window [~1325, ~1800]) and drains to 0 post-free.
…e 4.3) (#13)

- snmalloc-rs/src/profile.rs: new Weight enum (Requested / Allocated;
  Default = Allocated, matching the default UI view documented in
  profile-weight.md) and HeapProfile::write_flamegraph /
  write_flamegraph_with methods.  Output is Brendan Gregg's collapsed /
  folded-stack format: one line per unique stack as
  "<frame_root>;<frame_mid>;<frame_leaf> <weight>", root-first, each
  frame rendered as a zero-padded 16-hex code pointer (0x000000...).
  Identical stacks collapse into a single line with summed weights via
  a BTreeMap keyed on the pre-rendered hex form, which gives
  deterministic lex-ordered output for golden tests and version-control
  diffs.  No new dependencies -- uses std::io::Write only (gated by
  extern crate std on this no_std crate).
- snmalloc-rs/src/lib.rs: re-export the new Weight enum alongside
  HeapProfile / BtSample.
- snmalloc-rs/tests/profile_accuracy.rs: new integration suite.
  * accuracy_single_threaded -- 100_000 x 64B allocations at rate
    4096 must yield a sample count inside a 6-sigma Poisson envelope
    of lambda = 1562.5, and sum(weight) must match 6.4 MiB to within
    5%.
  * accuracy_multi_threaded -- 8 threads x 10_000 x 64B at the same
    rate; expected ~1250 samples +/- 6 sigma.  Documents the known
    O(1/N) per-thread teardown straggler from Phase 3.4 inline.
  * flamegraph_correctness_over_live_snapshot -- captures a snapshot
    with >= 100 samples, calls write_flamegraph into a Vec<u8>,
    parses every line as "<hex-stack> <weight>", asserts each frame
    is "0x" + 16 hex digits, asserts no stack appears twice (the
    collapse step worked), and asserts the sum of folded weights
    equals HeapProfile::total_allocated_bytes under the default
    projection.  A second pass with Weight::Requested verifies the
    explicit projection matches total_requested_bytes.
  * flamegraph_empty_snapshot_writes_nothing -- the no-op-safe
    contract for the profiling-feature-off build.
  All four tests acquire a process-wide accuracy_lock() so they do
  not race against each other for the global sampler state when
  cargo runs them in parallel, and each subtracts a baseline snapshot
  taken with sampling momentarily disabled so any leftover samples
  from sibling tests in the same binary do not perturb the Poisson
  assertions.  Tests are no-op on the profiling-feature-off build.
- Speedscope JSON export deferred to Phase 4.5+: speedscope already
  imports the folded format directly, and a faithful JSON profile
  schema is better layered on top of the symbolicator that lands in
  4.5.  Documented in the write_flamegraph rustdoc.

Verified:
- ctest --test-dir build (SNMALLOC_PROFILE=OFF): 96/96 passed.
- ctest --test-dir build-profile (SNMALLOC_PROFILE=ON): 96/96 passed.
- cargo test --all (no profiling feature): all crates green, 4
  profile_accuracy tests no-op pass, profile.rs unit tests including
  6 new flamegraph + Weight tests pass.
- cargo test --all --features profiling: all crates green, all 4
  profile_accuracy tests pass with live sampling.
- cargo doc --features profiling --no-deps: clean build, all new
  rustdoc renders.
- Add optional `symbolicate` Cargo feature that pulls in the
  `backtrace` crate as a dependency only when enabled.
- Add `ResolvedFrame { address, name, file, line }` for the
  per-frame metadata returned by the symbolicator.
- Add `HeapProfile::symbolize()` returning
  `HashMap<*const u8, ResolvedFrame>` keyed by raw frame addresses.
  Each unique frame is resolved once via `backtrace::resolve`.
- Add `HeapProfile::write_flamegraph_symbolized()` that renders the
  same folded-stack format as `write_flamegraph` but substitutes
  resolved function names for hex code pointers, falling back to
  the hex rendering when a frame has no resolved name.  `;` and
  space in resolved names are sanitised to `_` so the folded format
  stays unambiguous.
- Sum of weights from `write_flamegraph_symbolized` equals
  `total_allocated_bytes`, matching `write_flamegraph` under the
  documented default projection.
- Unit tests: smoke-test symbol resolution via a `#[inline(never)]`
  probe that captures its own backtrace, plus empty-profile,
  unresolved-frame, and hex-fallback contracts.
- Integration test (`tests/profile_symbolize.rs`): collect a live
  snapshot at the same rate/workload as `profile_accuracy`, verify
  >=50% of unique frames resolve to a non-None name, and verify
  `write_flamegraph_symbolized` parses cleanly, has no duplicate
  stacks, and preserves total weight.
- Add snmalloc-rs/src/config.rs introducing ProfileConfig (a typed,
  Default-impled struct of sampling_rate + enable_from_env) along with
  SnMalloc::configure_profiling and SnMalloc::init_profiling_from_env
  so callers don't have to wire set_sampling_rate by hand after
  installing the global allocator.
- Honour SNMALLOC_PROFILE_RATE (parseable integer wins, including 0)
  and SNMALLOC_PROFILE_ENABLE (truthy aliases 1/true/yes,
  case-insensitive, whitespace trimmed) when init_profiling_from_env
  is called; the resolver is read-only, panic-free, and a no-op when
  neither var is set.  Default rate when ENABLE=1 with no RATE is
  524288 bytes (512 KiB).
- No #[ctor] / static init -- explicit call from main is documented
  as cheaper and easier to reason about than allocator-vs-ctor
  ordering games.
- Re-export ProfileConfig + ENV_PROFILE_RATE + ENV_PROFILE_ENABLE
  from the crate root.
- Unit tests in src/config.rs cover Default, with_sampling_rate,
  configure_profiling round-trip + idempotency + zero-disables, and
  parse_bool_env recognition.
- New integration test tests/profile_runtime_config.rs serialises
  env-var manipulation with a local OnceLock<Mutex<()>> and a Drop
  guard that restores both env vars and the global sampling rate,
  so it doesn't race against profile_accuracy.rs sibling tests.
- All tests pass under both cargo test and cargo test --features
  profiling; cargo doc --features profiling --no-deps is warning-free.
…ase 4.6) (#16)

- snmalloc-rs/Cargo.toml: add `inferno = "0.11"` as a dev-dependency
  (test-only; never appears in the published crate's transitive deps).
  Version pin documented inline -- 0.11 keeps MSRV aligned with the
  rest of the workspace, while later 0.12.x bumps `rust-version` to
  1.71 and pulls in additional crossbeam transitive deps we don't
  otherwise need.
- snmalloc-rs/tests/profile_viewer_roundtrip.rs: new integration suite
  asserting that the folded-stack output emitted by Phase 4.3's
  `HeapProfile::write_flamegraph` is consumable by two real viewers
  in the Rust profiling ecosystem.  Test-only -- no public API on
  `HeapProfile` / `SnMalloc` is added, and `src/profile.rs` is not
  touched.
  * inferno_roundtrip -- captures a >=50-sample snapshot, writes its
    folded form into a `Vec<u8>`, hands it to `inferno::flamegraph
    ::from_reader` with `Options::default()`, and asserts the
    rendered SVG contains a `<svg` root and at least one `<g`
    stack-frame group node.  Confirms the round-trip from folded
    bytes to SVG works without any post-processing.
  * speedscope_folded_import -- re-implements the regex
    `^([^\s]+) (\d+)$` that speedscope's "Brendan Gregg's collapsed
    stack format" importer uses (per its wiki) and asserts >=95% of
    folded lines match.  speedscope itself runs in a browser/wasm
    context we can't drive in CI, so the conformance check is the
    next best thing.
  * round_trip_weight_invariance -- regression guard for the Phase
    4.3 BTreeMap collapse step: sum of folded weights over a
    real-workload snapshot must equal
    `HeapProfile::total_allocated_bytes` exactly.
  * empty_snapshot_viewer_safety -- runs in both feature
    configurations (no `#[cfg(feature = "profiling")]` gate).
    Confirms `write_flamegraph` on an empty profile writes zero
    bytes and that inferno cleanly returns `Err` rather than
    panicking when handed the resulting empty stream.  Covers the
    OFF-build path where every snapshot is empty by construction.
- Workload calibration: 5_000 x 64-byte allocations at sampling
  rate 512 -> ~625 expected samples (well above the 50-sample floor
  Phase 4.6 requires).  Smaller than the 100k workload in
  profile_accuracy.rs to keep CPU contention low when `cargo test
  --all --features profiling` runs the two test binaries in
  parallel.  Workload-driving helpers live in a
  `#[cfg(feature = "profiling")]` module to avoid dead-code warnings
  on the OFF build.

Verified:
- cargo test --all (profiling OFF): all binaries green, including
  the new profile_viewer_roundtrip binary running just
  empty_snapshot_viewer_safety.
- cargo test --all --features profiling: stable across 5
  back-to-back runs; all 4 new tests pass, all pre-existing tests
  pass.
- cargo test --features profiling --test profile_viewer_roundtrip:
  4 passed, 0 failed.
- No new compiler warnings in either feature configuration.
- New AllocationSampleList primitive: fixed-K (K=4) atomic slot array of
  noexcept callbacks invoked once per sampled allocation.  Lock-free
  register/unregister via per-slot CAS; broadcast iterates with relaxed
  loads.  Documented chosen storage and the no-allocation handler contract.
- record_alloc now broadcasts the just-installed SampledAlloc to every
  registered handler, alloc-only (matches tcmalloc semantics).  Broadcast
  is wrapped in its own ReentrancyGuard so a handler that allocates
  short-circuits the sampler via the existing reentry check.
- C exports sn_rust_profile_streaming_{start,stop} gated by
  SNMALLOC_PROFILE; a single FFI user callback at a time is bridged
  through a noexcept shim that converts SampledAlloc to
  SnRustProfileRawSample.  Stubs preserve link-compatibility in the
  SNMALLOC_PROFILE=OFF build.
- rust_profile.h declares the new entry points and the streaming contract.
- New profile_streaming ctest covers per-sample fan-out, parity with the
  SampledList live count, unregister-stops-broadcast, multi-subscriber
  fan-out, slot-exhaustion rejection, and the OFF-build smoke arm.
- New pub(crate) module snmalloc-rs/src/pprof.rs hand-rolls the
  protobuf3 wire format (varint + length-delimited) for the subset
  of Google's pprof Profile schema needed for snmalloc heap
  snapshots; no prost/flate2 dependencies added.
- HeapProfile::write_pprof emits two sample_type axes
  (alloc_objects/count, alloc_space/bytes) plus per-stack
  location/function chains; output is uncompressed (callers can
  wrap in GzEncoder if they want .pb.gz).
- Unsymbolicated frames render function name as 0x..hex.. with
  empty filename/line, mirroring write_flamegraph; symbolicated
  frames use names from HeapProfile::symbolize when available.
- Tests: 6 unit tests in src/pprof.rs (varint, empty profile,
  alloc_space-axis invariance under both Weight projections,
  function/location dedup, string-table slot-0 contract) +
  3 integration tests in tests/profile_pprof.rs gated on
  --features profiling (smoke, empty snapshot, total_weight ==
  total_allocated_bytes).
…rhead (#19)

- Phase 7.1: hoist bytes_until_sample into a dedicated alignas(64/128)
  SamplerHotState struct (128 bytes on Apple Silicon, 64 elsewhere) so the
  per-thread fast-path counter sits on its own cache line and cannot
  false-share with the colder Sampler tail (PRNG state, last_sample_,
  initialized_) or with concurrent dealloc slot-clear traffic.  Counter is
  the first member of the cache-aligned region (offset 0).  Adds a
  SNMALLOC_LIKELY annotation on the hot subtract+compare.
- Phase 7.3: new func test profile_overhead asserting
    a) sizeof(Config::PagemapEntry) is unchanged vs. an explicit
       StandardConfigClientMeta<NoClientMetaDataProvider> — proves the
       lazy provider type is compiled in but contributes zero bytes when
       profiling is off.
    b) bytes_until_sample lives at offset 0 of the cache-aligned hot
       state (offsetof check).
    c) Runtime gate: 1M alloc/free pairs of size 32 under
       Sampler::set_sampling_rate(0) (off) and Sampler::set_sampling_rate
       (2^40) (on, never fires) — assert ns/alloc ratio < 1.05, i.e. no
       branch-misprediction storm in the dealloc null-slot fast-path.
- Add snmalloc-sys extern "C" decls for sn_rust_profile_streaming_start
  / sn_rust_profile_streaming_stop, gated on the `profiling` feature.
- Introduce `snmalloc-rs::streaming` exposing `ProfilingSession`
  (RAII handle) plus a borrowed `StreamSample<'_>` view of the raw
  FFI sample.  Single-session-at-a-time semantics enforced through a
  process-global `Mutex<Option<Handler>>`; second `start()` returns
  `StreamingError::AlreadyActive`.
- Trampoline is a fixed `extern "C"` function that locks the slot,
  dispatches into the boxed `Fn` and catches panics so unwinds never
  cross the FFI boundary.  Handler bounds are `Send + Sync + 'static`.
- Drop unregisters from the C side, then clears the slot so a fresh
  `ProfilingSession::start` can succeed.
- Re-export `ProfilingSession`, `StreamSample`, `StreamingError`
  from the crate root under `#[cfg(feature = "profiling")]`.
- Add `tests/profile_streaming.rs` covering: smoke handler-invocation,
  double-start AlreadyActive recovery, drop-unregisters guarantee,
  and thread-safety under a concurrent allocator workload.
- New snmalloc-rs/tests/profile_pprof_roundtrip.rs (profiling-gated)
- `pprof_roundtrip_via_go_tool`: runs a small workload, writes the
  pprof bytes to a unique tempfile (no `tempfile` dep), and invokes
  `go tool pprof -raw <file>`.  Asserts exit 0 and that stdout
  contains a structural marker (`Samples:`, `sample_type`,
  `PeriodType`, or one of our axis names).
- `empty_snapshot_pprof_roundtrip`: same path but on a default
  `HeapProfile`; the metadata-only Profile must still parse.
- `skip_if_no_go` helper: probes `go version` and skips with an
  `eprintln!` when Go is not on PATH.  Keeps cargo test green on
  developer machines / CI images without a Go toolchain.
- No new dev-deps; stdlib only.  Tempfile path uses
  `temp_dir() + pid + SystemTime nanos`.
- Workload + process-wide mutex pattern mirrors profile_pprof.rs and
  profile_viewer_roundtrip.rs.
- benches/profile_bench.rs: three groups (small_allocs 32B,
  medium_allocs 4K, mixed 16..16384) x three variants
  (profile-off, profile-on-inactive at usize::MAX rate,
  profile-on-active at 512 KiB default rate). Hand-rolled main
  emits a stderr summary pointing at the ratio_idle metric used
  by CI to gate idle overhead at <= 5%.
- Cargo.toml: criterion 0.5 (no default features) as a dev-dep,
  [[bench]] entry with harness = false.
- benches/README.md: short doc on running, what ratio_idle means,
  why absolute numbers are host-specific.
- Add `profiling` job to rust.yml: cargo build/test --features profiling
  on ubuntu-latest, macos-14, macos-15 (release + debug, stable toolchain).
- Confirms main.yml already covers SNMALLOC_PROFILE=ON for ubuntu-24.04
  gcc/clang and macos-15 clang (added in Phase 3.0 + earlier macOS edit);
  no main.yml edits required.
- Restricted to Linux + macOS per task scope; Windows profile coverage
  can be added later if needed.
- 8 worker threads tight-loop alloc/free at sizes [16,64,256,1024,16384]
- 9th sampler thread snapshots SampledList every ~10ms for 5s
- exercises H1-H4 dealloc hooks + lock-free SampledList under churn
- TSan/ASan-clean by construction; sanitizer cmd lines documented inline
- SNMALLOC_PROFILE=OFF path collapses to a "skipped" stub
…#25)

- README.md: new H2 'Heap Profiling' section covering SNMALLOC_PROFILE
  CMake flag, default 524288-byte Poisson sampling rate, C ABI exports,
  pointer to the Rust crate, supported output formats (folded
  flamegraph + pprof), and the <1% overhead claim citing the Phase 7
  bench suite.
- snmalloc-rs/README.md: extended with a 'Heap Profiling' section
  documenting the profiling and symbolicate Cargo features, snapshot +
  flamegraph quick start, streaming ProfilingSession, env-var-driven
  init_profiling_from_env, pprof output via write_pprof, symbolicated
  flamegraphs, and the graceful feature-off fallbacks.
- All Rust code samples spot-checked against the actual public surface
  in snmalloc-rs/src/{lib,profile,config,streaming}.rs.
- Crate-level //! Heap Profiling section with end-to-end snapshot + flamegraph example
- HeapProfile struct / samples() / total_allocated_bytes() examples
- write_flamegraph and write_pprof File / Vec<u8> examples (no_run)
- Weight enum example showing Allocated vs Requested
- ProfilingSession::start example with shared atomic counter + RAII drop
- StreamSample accessor example covering alloc_ptr / requested_size /
  allocated_size / weight / stack
- SnMalloc::configure_profiling and init_profiling_from_env examples
- All examples compile under both --features profiling and the
  default build; cargo test --doc passes 10/10 (default) and 12/12
  (profiling feature on)
- Replace the hard 5% bound on sum(weight) with the derived 6-sigma
  envelope of the Poisson unbiased-sum estimator (Var ~ N*SIZE*RATE).
  At the chosen constants (N=100_000, SIZE=64, RATE=4096) the old 5%
  bound was only ~1.97 sigma, giving a ~5% per-run flake rate under
  sibling cargo-test CPU contention.  The new window is
  [5_428_293, 7_371_707] bytes around the 6_400_000 expected.
- Verified by running the test 50x in a tight loop: 0 failures.
- Ticket: 86aj0h83a.
- Adds two ubuntu-24.04 clang Debug matrix legs to the existing
  ubuntu job in .github/workflows/main.yml so the heap-profiling
  code paths exercised by perf-profile_stress and the func-profile_*
  suite are run under ThreadSanitizer and AddressSanitizer.
- Both legs configure -DSNMALLOC_PROFILE=ON and the project's
  existing SNMALLOC_SANITIZER cmake option (=thread / =address)
  instead of raw CMAKE_CXX_FLAGS=-fsanitize=...; this is the
  idiomatic mechanism already used by the existing "TSan + UBSan"
  matrix entries (CMakeLists.txt:73-75, 580-606, 668-672) and
  correctly wires -fsanitize through to test-target compile and
  link lines plus the SNMALLOC_THREAD_SANITIZER_ENABLED define
  the codebase guards on.
- The TSan leg installs libc++-dev and uses -stdlib=libc++ to
  match the existing TSan + UBSan legs (libstdc++ on Ubuntu is
  not TSan-instrumented).  The ASan leg uses the default
  libstdc++ runtime, which is ASan-compatible.
- Both legs pass `-R profile_` via test-extra-args so ctest runs
  only the profile suite (perf-profile_stress-{fast,check} +
  func-profile_*).  This bounds sanitizer overhead within the
  CI time budget while still exercising the new snapshot-under-
  churn workload from PR #24.
- Local validation: configured + built + ran perf-profile_stress-fast
  on darwin-arm64 with -DSNMALLOC_SANITIZER=address; the fast
  variant ran ~5s under ASan with no diagnostics.  TSan was not
  validated locally because the macOS toolchain available here
  does not ship a TSan-instrumented libc++; relying on the
  GitHub ubuntu-24.04 runner for that leg as called out in the
  ticket.
- New HeapProfile::write_pprof_gz<W: Write>(&mut self, w, weight) wraps
  the uncompressed write_pprof in flate2::write::GzEncoder so callers
  can produce the .pb.gz encoding accepted natively by Pyroscope,
  Polar Signals Cloud, Parca, Speedscope, and Datadog continuous
  profiler, as well as `go tool pprof`.
- flate2 added as an optional dep gated by the existing `profiling`
  Cargo feature; deliberately not a separate feature, since gzipped
  pprof is the dominant on-the-wire encoding and splitting it off
  would multiply the build matrix without a meaningful payoff.
- Three new integration tests in tests/profile_pprof_gz.rs covering
  the gzip-magic prefix, byte-for-byte round-trip equivalence with
  write_pprof through flate2::read::GzDecoder, and empty-snapshot
  totality.

ClickUp ticket: 86aj0h8af
…ly supported (#29)

* Publish heap-profiling benchmark results (86aj0h88j)

- Run snmalloc-rs/benches/profile_bench.rs end-to-end with --features
  profiling on Apple M4 Pro / macOS 26.3.1; capture mean / CI /
  median / stddev from target/criterion/*/new/estimates.json.
- New docs/heap-profiling-benchmarks.md table-formats the raw numbers
  for the small_allocs / medium_allocs / mixed groups across the three
  variants (profile-off, profile-on-inactive, profile-on-active).
- Compute ratio_idle and ratio_active per group; averages are ~1.024
  in both configurations, max ratio is 1.0493 on
  medium_allocs/profile-on-inactive. All groups stay inside the
  bench harness's documented <=1.05 acceptance band.
- Document the gap vs the existing "<1% overhead" README claim: small
  allocs support it (in noise), but medium and mixed land at ~3-5%.
  Recommend softening the README phrasing in a follow-up PR.
- No groups hit the 20-minute time budget; full sweep ~85s wall-clock.

* Link perf-regression ticket; keep README <1% claim as target

- Replace 'soften README claim' recommendation with link to
  ClickUp ticket 86aj0hfmc that drives medium/mixed under 1%
- Keep reproduction caveats (Linux pinning, larger sample_size)
- Per user direction: target stays; gap is a perf-regression
  follow-up, not a docs change
…6aj0hfmc) (#31)

- src/snmalloc/profile/sampler.h: hoist the per-thread `sampler_reentered()`
  check from `Sampler::record_alloc` into `record_alloc_slow`. The hot
  countdown is now a single TLS decrement plus a signed compare; the
  reentrancy check only runs on the ~1-in-512-KiB fraction of allocations
  that already cost a slow-path transition. Sample weighting unchanged --
  the `rate - hot_.bytes_until_sample + requested_size` formula already
  absorbs the overshoot when the counter ticks negative under re-entry.
- src/snmalloc/profile/record.h: reorder `record_dealloc<Config>` so the
  cheap slab-metadata probe and atomic-slot peek run before the
  `ReentrancyGuard` is constructed. The common-case (object on a slab
  with no installed lazy backing, or slab installed but specific object
  never sampled) now skips the TLS store-store-load round-trip from the
  guard.
- docs/heap-profiling-benchmarks.md: re-publish bench numbers after the
  fix. Idle ratios dropped from a max of 1.0493 to 1.0128 on this host,
  with two of three groups under 1.01. Documented the cross-run bimodal
  variance (20-80% on individual variants between back-to-back runs)
  that prevents this harness on this host from credibly resolving the
  remaining <3% gap on mixed/active.

ClickUp: 86aj0hfmc
…agnosis (86aj0kdym) (#40)

Three follow-up perf tweaks on top of bundle 1+3+2 (86aj0jfwh):

D. Drop `Sampler::initialized_` boolean and the dedicated
   `if (!initialized_)` branch in `record_alloc_slow`. Bootstrap state is
   inferred from `interval_at_capture_ == 0` (which is set to the active
   sampling rate on first successful slow-path completion; the rate==0
   short-circuit earlier means the value is always strictly positive
   after bootstrap, so it doubles as the "already bootstrapped" signal).
   Saves one member load + branch every slow-path entry after the first
   sample on the thread. `Sampler::debug_initialized()` continues to
   work via the same sentinel. The existing 100k stack-allocated
   `Sampler` unit-test (`test_sampler_bootstrap`) still hits the
   bootstrap branch on every instance.

E. 5-run noise diagnostic for `medium_allocs/profile-on-active`. The
   1.0794 ratio reported in a single PR-#33 run collapses to
   0.9990 +/- 0.0086 over 5 fresh `cargo bench --features profiling`
   runs on the same host (range [0.9853, 1.0090]; every run <= 1.01).
   The PR-#33 datapoint sits >9 stddevs from this mean; it is consistent
   with the bimodal macOS-laptop harness noise this doc has called out
   since Phase 7.2 rather than a real fast-path regression. Doc updated
   with the full 5-run table; no perfstat/dtrace cache-miss chase was
   warranted because the noise check showed no consistent signal.

F. Branch hints on `record_dealloc_peek<Config>`. The `p == nullptr`
   early-exit was mis-hinted `SNMALLOC_LIKELY` -- corrected to
   `SNMALLOC_UNLIKELY` since the overwhelmingly common case is a
   non-null `free(p)`. The two `slot == nullptr` /
   `slot->load() == nullptr` early-exits (the actual ~99.999%
   fall-through paths for non-sampled deallocs) already carried
   `SNMALLOC_LIKELY`; their hints are kept and the comments updated
   to call out the fall-through rate explicitly.

Verification:

* `ctest -R '^func-profile_'` -- 18 / 18 pass (including
  `test_sampler_bootstrap` which spawns 100k fresh Samplers).
* `cargo test --features profiling` -- 5 / 4 / 4 / 13 (lib + tests +
  doc) pass across 3 back-to-back runs.
* `nm` on the release `profile_bench` binary confirms
  `record_dealloc<Config>`, `record_dealloc_peek<Config>`,
  `find_profile_slot`, `tl_record_alloc`, and `clear_profile_slot`
  remain fully inlined; only `record_alloc_slow` and
  `record_alloc_from_namespace_tls` survive as out-of-line symbols
  (unchanged from bundle 1+3+2).
* `otool -tvV` on `_ZN8snmalloc7deallocEPv` shows the peek as a
  3-instruction `add / ldapr / cbnz` sequence at the call site --
  the "probe, load, jne" the bundle targets.

Touches `src/snmalloc/profile/sampler.h`,
`src/snmalloc/profile/record.h`, and
`docs/heap-profiling-benchmarks.md` only. Sampler public API
(`record_alloc`, `record_alloc_from_namespace_tls`) is unchanged.
Trailing backslashes on // comment lines line-continued the comment
into the next source line, which gcc -Werror=comment flags. They were
intended as shell continuations in the example commands but have no
meaning inside a C++ comment. Drop the backslashes; the example reads
the same.

Unblocks ubuntu-24.04 Release / Debug builds on fork main.
Add docs/profiling-pmu.md covering the four CPU-microarch gaps that
snmalloc itself does not sample: allocation hot-spots, cache misses
(Linux + macOS), false sharing, and branch-hint miss rates. Each
section provides a runnable perf/Instruments capture sequence plus
the join against snmalloc metadata (lookup_alloc_site from Phase 10.1,
branch_hints.json from Phase 10.2, automation via snmalloc-tools in
Phase 10.4). Closes with explicit non-goals so embedders know what
snmalloc will and will not do at runtime.

Link the new doc from the README Heap Profiling section.
…d Stats references (#42)

The USE_SNMALLOC_STATS CMake define was propagated by snmalloc-sys/build.rs
and BUILD.bazel, but the only code that observed it -- two #ifdef blocks in
src/test/perf/contention/contention.cc -- referenced a Stats class and
current_alloc_pool()->aggregate_stats() that no longer exist anywhere in
the source tree. The flag was bit-rot.

This commit:
- Deletes the two dead #ifdef USE_SNMALLOC_STATS blocks in contention.cc
  (the surviving usage::print_memory() call is untouched).
- Renames the CMake-facing symbol USE_SNMALLOC_STATS -> SNMALLOC_STATS in
  snmalloc-rs/snmalloc-sys/build.rs, BUILD.bazel, and docs/BUILDING.md.
- Leaves the public-facing snmalloc-rs "stats" Cargo feature unchanged;
  only the internal C-side symbol is renamed.

The renamed symbol is currently harmless (no C++ code consumes it). It is
re-claimed here so subsequent phases can wire real stats APIs onto it
without colliding with the old dead-code definition.

git grep USE_SNMALLOC_STATS now returns empty.
Adds scripts/dump_branch_hints.py, a stdlib-only Python 3 script that
scans src/snmalloc/ for every SNMALLOC_LIKELY(...) / SNMALLOC_UNLIKELY(...)
call site and emits a JSON sidecar of {file, line, kind} entries. The
macro-definition lines in ds_core/defines.h are filtered out so consumers
don't have to. Output is deterministically sorted for diff-friendly
review.

Wires it into CMake as a stand-alone target branch_hints_inventory that
writes the sidecar to ${CMAKE_BINARY_DIR}/snmalloc_branch_hints.json and
installs it under share/snmalloc/. The target is NOT a dep of the main
library so a missing Python interpreter never blocks ordinary builds —
FindPython3 is QUIET and the target is conditionally registered.

snmalloc-rs/snmalloc-sys/build.rs gains a best-effort step that locates
the sidecar (falling back to invoking dump_branch_hints.py directly when
the script is present in source_root/scripts/) and copies it into
OUT_DIR/branch_hints.json, exposing SNMALLOC_BRANCH_HINTS_JSON via
cargo:rustc-env for downstream Rust consumers. All failures are silent
so Rust builds keep working without python3 installed.

Consumed by Phase 10.4 (snmalloc-tools) to flag inverted hints from
perf branch-miss samples.
Two deliverables for the Phase 10 PMU-attribution work:

A. HeapProfile::top_sites(n, key) -> Vec<HotSite>
   Pure post-processing over the existing snapshot samples; ranks
   call sites by inclusive Weight::Allocated bytes. Three grouping
   modes (CallSite, LeafFrame, FullStack); CallSite currently
   degrades to LeafFrame in the unsymbolicated build pending a
   future symbol-based allocator-frame filter.

B. SnMalloc::lookup_alloc_site(addr) -> Option<Frames>
   Address -> alloc-site reverse lookup for live sampled
   allocations. Accepts interior pointers. Backed by a new
   header-only helper snmalloc::profile::lookup_alloc_site() that
   builds a sorted-by-base index from a SampledList snapshot at
   call time and binary-searches for containment. Off the alloc
   hot path; never mutates the lock-free SampledList.

C ABI surface:
   sn_rust_profile_lookup_alloc_site(addr, out_frames, max_frames,
                                     out_base_addr, out_allocated_size)
   Lives in rust.cc alongside the rest of the rust FFI shim (the
   Phase 10.1 spec called for a separate addr_lookup.cc; folding
   the symbol into rust.cc avoids duplicating the SNMALLOC_PROFILE
   build wiring and matches the existing pattern for every other
   sn_rust_profile_* export).
…ld) (#46)

Lands the public surface for the broader Phase 9 telemetry work.  All
wave-2 Phase 9 tickets (9.2 fast/slow path counters, 9.3 per-class
histograms, 9.4 mapping accounting, 9.5 lifetime histogram) will
populate fields on this struct without changing the wire layout.

Adds:
- src/snmalloc/global/stats_export.h declaring
  `struct snmalloc_full_stats` (POD layout, fixed-width integers,
  forward-compat `reserved[]` pool) and the
  `snmalloc_get_full_stats` C ABI getter prototype.  The
  SNMALLOC_FULL_STATS_VERSION macro lets newer producers add fields at
  trailing slots without invalidating older consumers.
- src/snmalloc/override/stats_export.cc implementing the getter:
  `memset(out, 0)` then populate `version`, `bytes_in_use`, and
  `peak_bytes_in_use` by delegating to
  `Alloc::Config::Backend::get_current_usage/get_peak_usage`.  Every
  other field stays zero at the scaffold stage.
- snmalloc-rs/snmalloc-sys/src/lib.rs FFI mirror (`#[repr(C)] struct
  snmalloc_full_stats`, matching `SNMALLOC_FULL_STATS_VERSION` /
  `SIZECLASS_SLOTS` / `LIFETIME_BUCKETS` / `RESERVED_SLOTS` constants,
  `extern "C" fn snmalloc_get_full_stats`).
- snmalloc-rs/src/lib.rs idiomatic Rust mirror `FullAllocStats` with
  `Copy`/`Debug`/`PartialEq` + manual `Default`, and
  `SnMalloc::full_stats()` method.  Gated behind the existing `stats`
  Cargo feature so consumers without it get a compile-time error
  rather than a runtime-zero stub.
- snmalloc-rs/tests/full_stats.rs integration test asserting the
  version matches `SNMALLOC_FULL_STATS_VERSION`, that
  `bytes_in_use > 0` after a 1 MiB live allocation, that
  `peak_bytes_in_use >= bytes_in_use`, that the peak is monotone
  across a dealloc, and that every wave-2 field reads as zero.

Wired into the build:
- CMakeLists.txt adds stats_export.cc to both the libsnmallocshim
  ALLOC list (so the symbol ships in libsnmallocshim.so/.dylib) and
  the Rust static-library RUST list (so it ships in
  libsnmallocshim-rust.a).
- snmalloc-rs/snmalloc-sys/build.rs includes the new TU on the
  `build_cc` path alongside rust.cc.

Verified:
- `cmake -B build -DSNMALLOC_STATS=ON -DSNMALLOC_RUST_SUPPORT=ON &&
  cmake --build build -j4` builds without errors.
- `nm build/libsnmallocshim.dylib | grep snmalloc_get_full_stats`
  shows the exported symbol; same for libsnmallocshim-rust.a.
- `cargo build` (without `stats`) succeeds and `full_stats()` is not
  visible.
- `cargo test --features stats --test full_stats` passes all 3
  scaffold tests.
…47)

New workspace member crate that joins external PMU output (Linux perf)
with snmalloc's in-tree allocation-site lookup (Phase 10.1) and
branch-hint inventory (Phase 10.2).

Subcommands:
- profile-top: top-N allocation sites from the in-process snapshot
- pmu-join cache-misses: data-addr -> alloc-site via lookup_alloc_site
- pmu-join c2c: HITM cache lines -> alloc-site
- branch-misses: cross-reference perf script with branch_hints.json

All subcommands support --json for structured output. The
lookup_alloc_site live-process limitation is documented in the
crate README and the CLI long_about; integration tests exercise
the in-process join path against allocations made by the test
binary itself.
Populate `bytes_mapped`, `bytes_committed`, and `bytes_decommitted_to_os`
in the FullAllocStats snapshot (`snmalloc_get_full_stats`).

* New `backend_helpers/fragstats.h` exposes the
  `BackendFragCounters` aggregator and `get_backend_frag_stats()`
  reader.  Two process-global `stl::Atomic<size_t>` counters track
  live committed bytes and cumulative bytes decommitted via the PAL.

* `commitrange.h` is instrumented at the `notify_using` /
  `notify_not_using` boundary: a successful commit bumps
  `bytes_committed`; every decommit subtracts it (clamped at zero) and
  adds to the monotone `bytes_decommitted_to_os` total.

* `bytes_mapped` reuses the existing StatsRange accounting that already
  backs `bytes_in_use`, since snmalloc only ever has live mappings for
  memory it also has a backend reservation for.

* `override/stats_export.cc` populates the three new fields inside a
  clearly-marked `// Phase 9.4` block, leaving the other wave-2
  ticket slots free.

* New Rust integration test `full_stats_backend_frag_invariants`
  exercises the wire-up: drives traffic through the CommitRange,
  asserts `bytes_committed > 0`, `bytes_committed <= bytes_mapped`,
  and that `bytes_decommitted_to_os` is monotone non-decreasing across
  a free.  The previous "fields are zero" assertion is dropped for the
  9.4 slots.
#49)

Add a `snmalloc::RuntimeConfig` singleton in `src/snmalloc/global/runtime_config.h`
that exposes three process-wide knobs that were previously compile-time
constants:

  * sample_interval_bytes (mean Poisson interval; default 512 KiB)
  * decay_rate_ms         (chunk decay window; default 50 ms)
  * max_local_cache_bytes (per-thread cache cap; default 1 MiB)

All three live in function-local `std::atomic` storage so they are safe
to call from any thread at any point in the process lifetime, including
before the first allocation (no global-init order dependency).

C ABI shims in `src/snmalloc/override/runtime_config.cc` expose the
canonical setter/getter pairs (`snmalloc_set_*` / `snmalloc_get_*`) and
the sample-interval setter additionally mirrors into
`Sampler::set_sampling_rate` when `SNMALLOC_PROFILE` is defined so the
existing profiler slow-path picks the value up without churning the
sampler hot path.

Rust bindings in `snmalloc-rs::SnMalloc` expose six methods
(`set_sample_interval` / `sample_interval` etc.) unconditionally,
independent of the `stats` and `profiling` Cargo features.  An
integration test `tests/runtime_tunables.rs` covers roundtrip, cross-
thread visibility, independence, and the non-zero default contract;
verified passing under default, `stats`, and `profiling` feature
configurations.

Backend read-side hooks for decay_rate_ms and max_local_cache_bytes are
deferred to a follow-up ticket: the existing decay path is entangled
with the `Range` template stack and the per-thread cache cap has a
similar shape, so a careful point-fix carries regression risk worth
isolating in its own change.  The setter / getter / FFI surface is
already in place so consumers can be wired without churning the C ABI.
Records log2-spaced sampled-allocation lifetimes in nanoseconds:

- New `snmalloc::profile::LifetimeHistogram` singleton with 32 atomic
  buckets, hit via `record_lifetime_ns()` on the dealloc path of a
  sampled allocation.
- `SampledAlloc::alloc_ts_ns` stamped from `steady_clock::now()` in
  `profile::record_alloc` right after the sampler slow path returns
  (sampler.h left untouched -- owned by ticket 9.7).
- `clear_profile_slot` computes the lifetime under the linearising CAS
  that retires each sample and bumps the matching log2 bucket.
- `snmalloc_get_full_stats` (Phase 9.1 scaffold) populates
  `lifetime_buckets_ns[32]` when `SNMALLOC_PROFILE` is defined; without
  it the field stays zero via the existing `memset`.
- New C ABI `sn_rust_profile_lifetime_histogram(out, len)` -> count
  exposes the buckets to Rust; degrades to a zero-writing stub when
  profiling is off.
- `HeapProfile::lifetime_histogram() -> [u64; 32]` is the safe Rust
  wrapper.  Integration test (`profile_lifetime_histogram.rs`) asserts
  the API smoke + an alloc/sleep(50ms)/dealloc round bumps a bucket
  with `log2(ns) >= 25`.
…51)

Wires up the wave-2 9.2 fields of `snmalloc_full_stats`:
fast_path_allocs / slow_path_allocs / fast_path_deallocs /
remote_deallocs / message_queue_drains /
cross_thread_messages_received.

Counters live in a new `FrontendStats` block embedded in every
`Allocator`, gated by `SNMALLOC_STATS` (new CMake option,
off by default).  All increments are non-atomic writes against
the per-thread allocator's `stats`, so the hot path stays
allocator-local; cross-thread reads from
`snmalloc_get_full_stats` sum the live `AllocPool::iterate()`
walk plus a process-global drain pot that
`ThreadAlloc::teardown` populates on thread exit so terminated
threads' counters survive into snapshots.

Tests added:
- src/test/func/fast_path_counters: C++ test that bursts 1k
  same-sizeclass allocs/frees on one thread, then spawns a
  worker that performs 128 cross-thread frees (each 512 bytes
  so K * 512 = 64 KiB saturates the worker's remote-dealloc
  cache and forces an in-thread `post()`).  Verifies all six
  9.2 counters move by their expected amounts.
- snmalloc-rs/tests/frontend_stats.rs: Rust integration test
  mirroring the C++ coverage, gated by the existing `stats`
  Cargo feature.
- snmalloc-rs/tests/full_stats.rs: existing scaffold test
  updated to no longer assert-zero the 9.2 fields (now
  populated); 9.3/9.4/9.5 fields still asserted-zero for the
  remaining wave-2 tickets.

ClickUp: 86aj0tr1e
Populate the four `FullAllocStats` per-class arrays
(`total_live_bytes_by_class`, `total_live_count_by_class`,
`cumulative_alloc_by_class`, `cumulative_dealloc_by_class`) by
embedding a per-thread `SizeClassStats` block alongside the Phase
9.2 `FrontendStats` block on `Allocator<Config>`.

* Counters are plain `uint64_t` arrays of length
  `NUM_SMALL_SIZECLASSES`, mutated only on the owning thread, so
  alloc / dealloc fast paths stay atomic-free.
* Bump sites: fast-path `small_alloc`, slow-path stash refill,
  and `small_refill_slow` after backend refill all bump
  `cumulative_alloc[sc]` + `live_count[sc]` + `live_bytes[sc]`.
  Local-fast-path dealloc decrements live and bumps
  `cumulative_dealloc`; remote dealloc bumps
  `cumulative_dealloc` on the freeing thread and defers the
  live decrement to the owning thread's
  `handle_dealloc_remote` message-queue drain (delta computed
  from `bytes_returned`).
* A process-global `SizeClassStatsGlobal` aggregator with relaxed
  atomics catches counters drained at thread teardown (extending
  the existing `drain_stats_to_global` path so pool reuse stays
  clean).
* `snmalloc_get_full_stats` extends the Phase 9.2 pool walk with a
  `SizeClassStats` accumulator and copies the result into the FFI
  struct.  Static assert pins
  `NUM_SMALL_SIZECLASSES <= SNMALLOC_FULL_STATS_SIZECLASS_SLOTS`.
* Compiles away entirely with `SNMALLOC_STATS=OFF`.

Tests
* New `snmalloc-rs/tests/sizeclass_histogram.rs` (gated `stats`
  feature): pins a single sizeclass, asserts cumulative + live
  rise by N, frees, asserts live drops and cumulative_dealloc
  rises monotonically.  Second test asserts the
  `cumulative_alloc >= cumulative_dealloc` invariant across every
  slot.
* `snmalloc-rs/tests/full_stats.rs`: removes the 9.3 zero
  assertions (fields are now wired).
* Verified: all 106 C++ ctest cases pass with stats on, all
  snmalloc-rs tests pass with `--features stats`, and the
  stats-off build remains clean.

ClickUp: 86aj0tr4p
#53)

Adds a tcmalloc-style human-readable text dump over the Phase 9.1
FullAllocStats snapshot.  Pure formatter -- no new telemetry.  Exposes:

  * snmalloc::dump_stats(FILE*) / dump_stats_to_string(std::string&)
    C++ overloads.
  * snmalloc_dump_stats_to_buffer(buf, len) FFI-safe buffer routine
    with snprintf truncation semantics.
  * SnMalloc::dump_stats(&mut impl io::Write) safe Rust wrapper that
    uses the standard size-query + alloc + fill two-phase pattern.

Output is a header of MALLOC: lines (bytes in use, peak,
committed/decommitted, fast/slow alloc/dealloc counters,
cross-thread message metrics).  Optional sections appear when the
underlying data is non-zero: a per-size-class table (populated by
9.3) and a log2-spaced lifetime histogram (populated by 9.5).

Integration test (snmalloc-rs/tests/dump_stats.rs) covers structural
regex match against the canonical 'Bytes in use by application' line,
writer-error propagation, and back-to-back-call independence.
With the `symbolicate` Cargo feature enabled, `top_sites` now walks
each sample's stack from the leaf outward and buckets on the first
frame whose resolved symbol does **not** match an allocator
namespace prefix (`snmalloc::`, `snmalloc_rs::`, `snmalloc_sys::`,
the mangled `_ZN8snmalloc`, or the `__rust_alloc` / `__rg_alloc`
GlobalAlloc thunks).  If every frame is allocator-internal the leaf
frame is used so no sample is dropped.

Without `symbolicate`, `CallSite` degrades to `LeafFrame` and emits
a one-shot `eprintln!` (guarded by `std::sync::Once`) advertising
the feature.  The fallback is total: synthetic samples still
produce a non-empty result.

Tests:
- callsite_groups_by_user_caller (symbolicate): two distinctly
  named, `#[inline(never)]` probe functions capture real backtraces
  via `backtrace::trace`; `top_sites(.., CallSite)` produces two
  buckets and conserves total bytes/sample count.
- callsite_falls_back_when_no_user_frame (symbolicate): a sample
  whose entire stack is unresolvable still produces a non-empty
  bucket whose leaf is the unresolvable address (not the empty-
  stack null sentinel).
- callsite_fallback_when_unsymbolicated (default features): pins
  the fallback contract -- CallSite behaves as LeafFrame and
  doesn't panic.

ClickUp: 86aj0x1qb
)

The Phase 10.2 sidecar generator (scripts/dump_branch_hints.py at the
snmalloc repo root) ships only with the surrounding repo, not with the
published snmalloc-sys crate. snmalloc-sys's Cargo `include` whitelists
`upstream/CMakeLists.txt`, `upstream/src/**`, and `upstream/fuzzing/**`
-- everything else under the repo root, including `scripts/`, is
stripped by `cargo package`. Result: consumers installing via
`cargo add snmalloc-rs --features stats` never see the script, so the
build.rs best-effort fallback that runs it to generate
`OUT_DIR/branch_hints.json` is a no-op for them, and snmalloc-tools
(Phase 10.4) loses its sidecar.

Fix: vendor the script under `snmalloc-rs/snmalloc-sys/upstream/scripts/`
and extend the Cargo include whitelist to cover `upstream/scripts/**`.
The new copy carries a header pointing back at the canonical source so
re-vendoring stays explicit ("update upstream and re-vendor"). The
repo-root `scripts/dump_branch_hints.py` is left in place as the
canonical version; this commit only adds a second copy under the
vendored tree.

build.rs gains two small upgrades:

1. The python3 fallback now invokes the script with both `--repo-root`
   and `--source-dir` explicitly, derived by canonicalising
   `<upstream>/src/snmalloc`. The script's default behaviour is to
   compute paths relative to `--repo-root`, but in the snmalloc dev
   tree `upstream/src` is a symlink that resolves *out* of `upstream/`,
   so the old single-argument invocation crashed with
   `Path.relative_to` raising `ValueError`. The new invocation handles
   both the symlinked dev layout and the flat published-crate layout
   without touching the script semantics.

2. `cargo:rerun-if-changed=<script>` is now emitted before invoking
   python3 so re-vendoring picks up automatically on incremental
   builds.

Verification:
  * `cargo package --list -p snmalloc-sys` shows
    `upstream/scripts/dump_branch_hints.py` in the tarball file list.
  * Consumer smoke test (`cargo new` + `cargo add --path
    /Users/jayakasa/dev/snmalloc/snmalloc-rs --features stats` +
    `cargo build -vv`) shows
    `cargo:rustc-env=SNMALLOC_BRANCH_HINTS_JSON=<OUT_DIR>/branch_hints.json`
    and the file contains 101 hint sites (50/51 LIKELY/UNLIKELY) over
    7152 bytes.
  * `cargo test -p snmalloc-rs --features stats` still passes
    (including the existing branch-hints fixture coverage in
    snmalloc-tools integration tests).
…ved[0..16] (#57)

Surface a log2-bucketed view of currently-free chunks held inside the
LargeBuddyRange pools via the FullAllocStats FFI surface.  The
histogram lives in `reserved[0..15]`, bumping SNMALLOC_FULL_STATS_VERSION
to 2 as an additive (offset-preserving) extension of the wire format.

Backend wiring:
- `Buddy` gains a histogram-callback template parameter (default
  `BuddyNoHistogram`, a no-op) so existing users like `SmallBuddyRange`
  pay zero overhead.  Insertions/removals of free blocks into the
  per-bucket cache and red-black tree invoke `on_add` / `on_remove`.
- `LargeBuddyRange` plugs in the new `LargeBuddyFreeChunkHistogram`,
  a process-global atomic array (16 buckets, `MIN_CHUNK_BITS` based)
  aggregating populations across every live `LargeBuddyRange` Buddy.
- `BackendFragStats` carries the histogram alongside the existing
  Phase 9.4 commit/decommit counters; `get_backend_frag_stats()`
  snapshots all three.
- `LargeBuddyRange::Type::get_free_chunk_count_by_log_size` is the
  range-API accessor; the FullAllocStats getter in stats_export.cc
  copies the 16 buckets into `reserved[0..15]`.

FFI / Rust binding:
- `SNMALLOC_FULL_STATS_VERSION` bumped to 2.
- New `SNMALLOC_FULL_STATS_FREECHUNK_BUCKETS = 16` constant.
- `snmalloc-sys` re-exports both.
- `FullAllocStats` gains a `reserved: [u64; 64]` field and a typed
  `free_chunk_histogram() -> [u64; 16]` accessor.

Test:
- `full_stats_freechunk_histogram_populates` (gated on the `stats`
  Cargo feature): drive 10 x 1 MiB alloc+free through the allocator,
  assert at least one histogram bucket is non-zero and that the typed
  accessor agrees with the raw `reserved[]` slots.
Add a Criterion bench (snmalloc-rs/benches/stats_bench.rs) that
mirrors profile_bench.rs but installs SnMalloc as the
#[global_allocator] so the sn_rust_alloc / sn_rust_dealloc FFI
thunks (which carry the SNMALLOC_STATS counter sites) are
actually exercised on each iteration. Without the global-allocator
install the bench measures libc malloc and the stats feature has
no observable effect.

The on/off comparison is across two cargo bench runs of the same
binary spec (cargo features are compile-time gates), and the
criterion sub-directory name (stats-on vs stats-off) keeps the
two runs from overwriting each other.

Acceptance per Phase 9 wave-2 spec is max 5-run mean ratio
<= 1.02. Measured on Apple M4 Pro (fat-LTO, release):

  small_allocs  : 5-run mean ratio 1.4370 (median 1.2790)
  medium_allocs : 5-run mean ratio 1.0261 (median 1.0983)
  mixed         : 5-run mean ratio 1.5339 (median 1.1251)

Every group fails. Even discounting bimodal harness outliers,
every group's median ratio is >= 1.10 -- signal is real, not
noise. Follow-up ticket 11.5 (86aj0xap7) tracks the hot-path
reduction work; this PR is verify-only per spec.

Full numbers and methodology are appended to
docs/heap-profiling-benchmarks.md under "Phase 9 stats overhead".

ClickUp: 86aj0x1f4
…e padding + trim cumulative arrays (#58)

Applies two of the three candidate levers from ticket 86aj0xap7:

* Lever 1 — `alignas(CACHELINE_SIZE)` on `FrontendStats` and
  `SizeClassStats` so the per-thread counter blocks sit on
  dedicated cache lines, eliminating false sharing with adjacent
  hot `Allocator` members.

* Lever 3 — drop the per-class `SizeClassStats::cumulative_alloc`
  store from the alloc fast path; derive the value at snapshot
  time from the invariant
  `cumulative_alloc = live_count + cumulative_dealloc`. FFI /
  output layout unchanged.

5-run mean ratios (SNMALLOC_STATS=ON / OFF) on the same harness
and host that produced Phase 11.1's failing baseline:

* small_allocs:  1.4370 -> 1.1588
* medium_allocs: 1.0261 -> 1.0337
* mixed:         1.5339 -> 1.0975

Worst-case 5-run mean cut from `mixed` 1.5339 down to
`small_allocs` 1.1588 — roughly a 60% reduction in the
over-budget portion. The 1.02 spec target is NOT reached: the
remaining ~16% on `small_allocs` is the irreducible cost of the
four remaining counter stores on the small-alloc fast path
(`fast_path_allocs++`, `live_count[sc]++`, `live_bytes[sc] += sz`
plus the corresponding fast-dealloc trio). None can be elided
while keeping the existing observability surface intact.

Lever 2 (batch counter updates) was investigated and shelved —
the existing per-thread counters are already non-atomic stores
into a cache-line-resident block; there is nothing meaningful
to batch except the stores themselves, which the compiler
already coalesces when inlined.

Recommendation captured in the docs and routed to a follow-up
ticket: split `SNMALLOC_STATS` into `_BASIC` (8 counters,
target <= 1.02) for production and `_FULL` (current behaviour,
adds per-class + lifetime histograms, target <= 1.20) for
diagnostic builds. Alternative: tighten the spec target from
1.02 -> 1.17 to acknowledge the fundamental counter cost.

Docs updated: `docs/heap-profiling-benchmarks.md` "Phase 9 stats
overhead" section now records the post-Phase-11.5 numbers,
marks acceptance as PARTIAL, and documents the recommendation.
Splits the monolithic SNMALLOC_STATS flag into two independently
selectable tiers so production builds can opt into the cheap
counter surface without paying for the expensive per-size-class
histogram.

* SNMALLOC_STATS_BASIC -- frontend fast/slow path counters (9.2) +
  backend commit/decommit (9.4) + largebuddy free-chunk histogram
  (11.4).  Target overhead <=2% (measured 1.03-1.08 on this host).
* SNMALLOC_STATS_FULL -- BASIC plus per-size-class histogram (9.3)
  and lifetime histogram (9.5).  Target overhead <=20% (measured
  1.09-1.16).

The legacy SNMALLOC_STATS flag is preserved as a backwards-
compatible alias for BASIC; FULL implicitly enables BASIC.  The
FullAllocStats wire format is unchanged -- fields the active tier
does not maintain simply read as zero -- so SNMALLOC_FULL_STATS_VERSION
is not bumped.

Cargo: `stats-basic` and `stats-full` features added in both
snmalloc-rs and snmalloc-sys; `stats` is now an alias for
`stats-basic`; `stats-full` implies `stats-basic` so the snmalloc-rs
SnMalloc::full_stats() accessor remains available under either tier.

5-run bench results on Apple M4 Pro (vs OFF baseline):

  Group           basic/off  full/off
  small_allocs     1.0774     1.1639
  medium_allocs    1.0398     1.0935
  mixed            1.0310     1.0910

FULL meets the <=1.20 budget on every group.  BASIC sits ~5-8%
above OFF -- above the 1.02 spec but ~50% closer than the 1.16
Phase 11.5 floor.  The remaining ~8% on small_allocs is the
irreducible cost of two non-atomic stores per alloc+dealloc
(stats.fast_path_allocs++ / stats.fast_path_deallocs++) on a
~200 ns inner-loop iteration.  See docs/heap-profiling-benchmarks.md
"Phase 11.6 -- tiered SNMALLOC_STATS overhead" for the full table
and methodology.

ClickUp: 86aj0ydjv
The `frontend_stats`, `full_stats`, `sizeclass_histogram`, and
`profile_lifetime_histogram` integration tests rely on the test
binary's allocations feeding snmalloc's process-global counters.
Without `#[global_allocator] static ALLOC: SnMalloc = SnMalloc;`
at the top of each binary, the default cargo test runner routes
allocations through the OS allocator and the counters under test
stay at zero, causing intermittent panics such as
`fast_path_allocs delta (=0) must rise by at least 990`.

Mirrors the pattern already used by `snmalloc-rs/benches/stats_bench.rs`
(Phase 11.1).  No test logic was changed.

ClickUp: 86aj0yehx
Move the fast_path_allocs counter update out of the per-alloc fast path
into a single pre-credit at refill time. The slow path knows the refilled
free-list length N, so it credits fast_path_allocs += N once at
small_refill / small_refill_slow and the fast path skips the store
entirely.

Plumbed via a new uint16_t& out parameter on
FrontendSlabMetadata::alloc_free_list, computed as
sizeclass_to_slab_object_count(sizeclass) - remaining (exact for
freshly-built slabs, upper-bound for recycled slabs from the per-class
stash). Bounded by the slab object count, ~256 for the smallest classes.

Trade-off: counter may briefly overshoot true alloc count by up to N
between refills. Acceptable for observability.

Bench numbers (5 runs per variant, Apple M4 Pro, fat-LTO):
  small_allocs  1.0774 -> 1.0155  (PASS, ~80% closer to spec)
  medium_allocs 1.0398 -> 1.0202  (FAIL*, within bench noise)
  mixed         1.0310 -> 1.0290  (FAIL, untouched dealloc-side counter)

Result PARTIAL on the strict <=1.02 spec; small_allocs (the targeted
group) passes cleanly. Phase 11.9 is filed to apply the same approach
to dealloc-side counters.

See docs/heap-profiling-benchmarks.md "Phase 11.8 -- batched fast_path
counter updates" for the full table.
Mirrors the Phase 11.8 batched-counter pattern on the dealloc
side: drop the per-dealloc `stats.fast_path_deallocs++` store at
the local-owner branch of `Allocator::dealloc` and pre-credit
`stats.fast_path_deallocs += refill_count` at slab refill in
`small_refill` / `small_refill_slow`.  Each object placed onto
the fast free list is assumed to be freed locally; cross-thread
frees still bump `remote_deallocs` per-object, so the granting
thread's `fast_path_deallocs` is over-credited by the count of
objects freed by another thread (drift is bounded by program
behaviour and documented on the field).

The `frontend_stats.rs::fast_path_alloc_counter_grows` test now
measures the cumulative dealloc count against the `before`
snapshot rather than `after_alloc`, since the credit lands at
slab-grant time (before the explicit dealloc loop) -- same
end-to-end invariant, just a different measurement window.

Apples-to-apples 2-run mean on the same host vs the 11.8
baseline at HEAD:
  small_allocs:   0.9960 (11.8) -> 1.0006 (11.9), both PASS
  medium_allocs:  1.0616 (11.8) -> 1.0611 (11.9), both FAIL
  mixed:          1.0271 (11.8) -> 1.0244 (11.9), both FAIL

The dealloc store is gone but `medium_allocs` did not close --
the residual ~5-6% on this host is not store-bound; the bench
ratio for medium_allocs is unchanged between 11.8 and 11.9.
Likely candidates are bytes_in_use atomics on the slab refill
path and codegen differences between OFF and BASIC compiles.
Closing that gap requires either a sampled-counter tier or
spec relaxation; tracked in docs/heap-profiling-benchmarks.md
(Phase 11.9 section).
`BackendFragCounters::bytes_committed` + `bytes_decommitted_to_os`
shared a cache line, as did `StatsRange::current_usage` +
`peak_usage`. Every `notify_using` invalidated the line that the
matching `notify_not_using` had just read, and the
`current_usage`/`peak_usage` CAS dance bounced the line for no
reason.

Add `alignas(64)` to each global atomic so each lives on its own
cache line. Cost: ~96 bytes of additional BSS per template
instantiation. Correctness unchanged.

Diagnostic write-up + recommended next steps in
docs/heap-profiling-diagnostic-11-10.md.
5-run sweep on Apple M4 Pro after merging Phase 11.10 alignas(64)
padding (commit f3ee3a1).

Results:
  small_allocs   0.996  PASS
  medium_allocs  1.122  FAIL (variance-dominated, sigma 4.7%)
  mixed          1.018  PASS (moved from 1.027 post-alignas)

Disassembly diff confirms zero instruction delta in the inline
Allocator<...>::small_alloc and ::dealloc fast paths. Remaining
cost lives in the _malloc / _calloc FFI shim thunks (+10 / +14
instructions). medium_allocs amplifies the shim cost because its
4 KiB allocs go through std::alloc::alloc on every iteration.

mixed passing the strict 1.02 spec is the new datapoint here.
medium_allocs variance exceeds the spec gap; Linux pinned bench
(ticket 86aj0jg36) is the authoritative next step.
Disassembly of `_malloc` on the Phase 11.11 baseline showed the
BASIC tier `medium_allocs` residual cost concentrated at two
adjacent counter stores on the small-refill slow path:

  - `stats.slow_path_allocs++` at the entry to `small_refill`
    (ldr/add/str on field 0x2388).
  - `stats.fast_path_allocs += refill_count` at the refill site
    (ldr/add/str on adjacent field 0x2380).

`medium_allocs` (4 KiB allocations) hits `small_refill` more
often than `small_allocs` because each chunk yields fewer
objects per refill, so the per-refill counter cost is the
residual.

Pack the two fields into one 64-bit `FrontendStats::packed_allocs`:
  - bits  0-47: cumulative_allocs (fast + slow combined)
  - bits 48-63: slow-path call count

At the refill site the two stores collapse into ONE packed `+=`:

  stats.packed_allocs +=
    static_cast<uint64_t>(refill_count) + PACKED_ALLOCS_SLOW_INC;

The two lanes occupy disjoint bit ranges so the packed `+=` is
correct as long as neither lane overflows its sub-field width.
The 16-bit slow lane saturates at 65535 refills (~16M allocs
per thread for the smallest sizeclasses); effectively unbounded
for any realistic workload on an observability surface.

The `FullAllocStats` FFI struct is unchanged: at aggregation
time `stats_export.cc` decodes the packed word back into the
public `fast_path_allocs` and `slow_path_allocs` fields.  The
`FrontendStatsGlobal` thread-exit aggregator drops to a single
`fetch_add` for the combined counter.

Bench results (apple silicon, paired OFF/BASIC):

  group           |  OFF (ns) | BASIC (ns) | ratio |
  small_allocs    |    ~203.7 |     ~203.7 |  1.00 |
  medium_allocs   |    ~1039  |     ~1032  |  0.99 |
  mixed           |    ~612   |     ~612   |  1.00 |

vs Phase 11.11 baseline (medium 1.122) -- medium drops to 0.99
(within bench noise of stats-off), all groups <= 1.02.

Disassembly delta: the 3-inst `slow_path_allocs++` block at the
entry to the inlined `small_refill` is gone; the
`fast_path_allocs +=` becomes a 6-inst packed update with one
constant materialization for `1ULL << 48`.  Net -1 inst in the
inlined body and -1 STORE to a separate counter field per
slow-path call.
Phase 11.9 moved fast_path_deallocs counter updates from the
per-dealloc hot path to a pre-credit at small_refill (alloc time).
The test's snapshot window `after_alloc -> after_dealloc` therefore
captured zero rise even though the counter had already been
credited the matching ~1024 deallocs during the alloc phase.

Switch the dealloc-side measurement to `after_dealloc - before`,
matching the same fix the Rust frontend_stats test received in
Phase 11.9.  C++ test logic was missed at the time.

Verified locally:
  - ctest -E "long|stress": 104/104 pass
  - cargo test --features stats-basic / stats-full / profiling: green
  - cargo test --workspace: green
Test-only deps (fuzztest, googletest) drag in stale rules_go that breaks
downstream consumers using newer rules_cc (the cgo.bzl in older rules_go
references CcInfo at its pre-move path).  Mark them dev_dependency=True
so they are only loaded when snmalloc is the root module.

Also gate the rust toolchain registration as dev_dependency: downstream
workspaces register their own pin, and silently overriding theirs leads
to subtle version skew.
fuzztest + googletest are only consumed by snmalloc's own C++ tests.
Marking them dev keeps them out of downstream resolution — fuzztest
otherwise drags rules_go in, whose cgo.bzl references a CcInfo symbol
removed in modern rules_cc and breaks any bzlmod consumer.

The rules_rust toolchain extension is likewise dev-only — downstream
workspaces pin their own toolchain and a transitive registration here
would silently overlay it.
- cmake/snmalloc_pgo.cmake — included unconditionally at L138
- cmake/run_coverage.cmake — referenced elsewhere
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant