Heap profiling + tcmalloc-style telemetry parity (Phases 2–11) by jayakasadev · Pull Request #857 · microsoft/snmalloc

jayakasadev · 2026-06-12T19:54:22Z

Summary

Mega-PR landing the full heap-profiling + tcmalloc-style telemetry stack from the jayakasadev/snmalloc development fork onto microsoft/snmalloc:main. 65 squash commits, 113 files, +27,141 / -48 lines.

Caveat up front: this is intentionally large. The maintainer's preferred chunking can shape a follow-up split if review-by-phase is preferred — the per-phase commits are listed below so they can be cherry-picked individually if needed. Upstream PR #852 (rust-heap-profiling-infra) is the partial Phase 2 predecessor of this work and is superseded by this PR if merged.

Phases shipped

Phase 2 — C++ sampling infrastructure (PRs Fix typo in threadalloc.h #2/Windows 32bit build #3/Merge changes required for using snmalloc in FreeBSD libc #4 on fork)
- Per-thread Poisson sampler (Sampler class) with bytes_until_sample_ countdown
- Lock-free SampledList + pre-allocated node pool
- Re-entrancy guard for backtrace()-style stack walkers
- Pluggable stack walker abstraction (FP-walk default, libunwind/backtrace/CaptureStackBackTrace opt-in)
- LazyArrayClientMetaDataProvider primitive (zero slab-meta bytes when profile inactive)
- aarch64 PAC handling on Apple Silicon
Phase 3 — Allocation hooks + C exports (PRs Address space constrained option #5–9)
- ProfilingConfig with lazy provider
- Single-chokepoint instrumentation: snmalloc::alloc(size_t) + Allocator::dealloc H1–H4 sites
- Covers realloc / calloc / aligned_alloc / posix_memalign / large alloc / GWP-ASan secondary / slow-path recursion
- SNMALLOC_PROFILE CMake gate + CI matrix entries
Phase 4 — Rust snapshot API (PRs Hardening allocator #10–16)
- profiling Cargo feature + snmalloc-sys FFI declarations
- HeapProfile + BtSample + snapshot()/set_sampling_rate()
- write_flamegraph() folded-stack output
- Dump-time symbolicator (backtrace crate)
- Runtime config (env vars + SnMalloc::configure_profiling())
- Speedscope + Inferno round-trip tests
Phase 5 — Streaming allocation mode (PRs Fix the condition on when to allocate a new block. #17, Add malloc tests #20)
- AllocationSampleList C++ + ReportMalloc broadcast
- sn_rust_profile_start/stop C exports + Rust ProfilingSession
Phase 6 — pprof output (PRs Make internal symbols hidden. #18, Place the next pointer at a different place on every object. #21)
- HeapProfile::write_pprof() + pprof proto encoding
- go tool pprof integration test
Phase 7 — Performance hardening (PRs Make internal symbols hidden. #19, Pal zero bug alignment #22–24)
- Cache-line placement of bytes_until_sample_
- Criterion bench suite (snmalloc-rs/benches/profile_bench.rs)
- Snapshot-under-churn TSan + ASan stress test
- CI matrix expansion (Linux + macOS, gcc + clang, SNMALLOC_PROFILE=ON/OFF)
- Profile fast-path overhead measured at ~0% within bench noise (docs/heap-profiling-benchmarks.md)
Phase 8 — Documentation (PRs Made the malloc tests run on Windows. #25, Tweaks to end bounds checking. #26)
- README profiling section + sampling-rate guidance + viewer tooling
- Rust doc examples for snapshot()/write_flamegraph()/write_pprof()
- Release notes deferred until this PR's review concludes
Phase 9 — Allocator-side telemetry parity with tcmalloc (PRs ds/bits contains decidedly not bit-like things #42, CMake Header-Only Target #46, Add instructions on how to use the header-only library #48–53)
- FullAllocStats typed struct + C ABI + Rust binding
- Per-thread frontend cache stats (fast/slow path + remote + msg-queue counters)
- Per-size-class histogram (live + cumulative alloc/dealloc, FULL tier only)
- Backend fragmentation (mapped/committed/decommitted_to_os)
- Sample lifetime histogram (log2 buckets, profile-gated)
- Text dump API (snmalloc::dump_stats / SnMalloc::dump_stats, tcmalloc-style MALLOC: lines)
- Runtime tunables (sample rate, decay rate, max local cache)
- USE_SNMALLOC_STATS → SNMALLOC_STATS rename (cleanup of dead aggregate_stats refs)
Phase 10 — PMU-backed CPU-microarch profiling (PRs CHERI Preparatory work #41, Expensive test property #43, Added error message to Windows Pal using VirtualAlloc. #44, Remove two unused functions. #47)
- Hot-spot table API + lookup_alloc_site(addr) reverse lookup
- Build-time SNMALLOC_LIKELY/UNLIKELY inventory dumper (scripts/dump_branch_hints.py)
- PMU workflow docs (docs/profiling-pmu.md)
- snmalloc-tools Rust crate (CLI joiner over perf record / perf c2c / perf script)
Phase 11 — Overhead reduction + polish (PRs Made the statistics print atexit #54–66)
- Tiered stats: SNMALLOC_STATS_BASIC (≤ 2% overhead target) + SNMALLOC_STATS_FULL (≤ 20% target)
- Batched counter updates at small_refill (Phase 11.8 / 11.9 / 11.12)
- Cache-line padded backend atomics (Phase 11.10)
- Symbolicate-aware HotSpotKey::CallSite filter
- Vendor dump_branch_hints.py into snmalloc-sys/upstream/
- Largebuddy free-chunk histogram into FullAllocStats.reserved[0..16]
- Final bench (Apple M4 Pro): BASIC ≤ 1.02 on small_allocs / medium_allocs / mixed; FULL ≤ 1.20 on all

Final overhead summary

5-run mean ratios from snmalloc-rs/benches/stats_bench.rs and snmalloc-rs/benches/profile_bench.rs on Apple M4 Pro, release + fat-LTO:

Mode	small_allocs	medium_allocs	mixed
`SNMALLOC_PROFILE=ON` (idle)	1.0036	0.9998	0.9925
`SNMALLOC_PROFILE=ON` (active, 512 KiB sample)	0.9983	0.9990	1.0026
`SNMALLOC_STATS_BASIC=ON`	~1.00	0.99	~1.00
`SNMALLOC_STATS_FULL=ON`	1.164	1.094	1.091

All within target. Full numbers + methodology in docs/heap-profiling-benchmarks.md.

ABI

FullAllocStats C struct uses a SNMALLOC_FULL_STATS_VERSION field (currently 2) + reserved[64] for forward-compat. Wave-2 fields stay zero when their build flag is off; existing fields remain populated. The legacy SNMALLOC_STATS=ON flag is preserved as an alias for SNMALLOC_STATS_BASIC.

Test coverage

Full local sweep on Apple M4 Pro (CI minutes exhausted on fork; re-running here is gated by maintainer):

C++ ctest: 104/104 PASS (no long/stress jobs)
cargo test (no features): PASS
cargo test --features stats-basic: PASS
cargo test --features stats-full: PASS
cargo test --features profiling: PASS
cargo test --features profiling,symbolicate: PASS
cargo test --workspace (incl. snmalloc-tools): PASS

Review chunking suggestion

If the maintainer would prefer phase-by-phase landing, the squash commits listed in the commit history map 1:1 to fork PRs. Phase 2 and Phase 3 are the entry-points (everything else depends on the C++ sampling + hook infrastructure they introduce). After those land upstream, the remaining phases can land as independent PRs cherry-picked from this branch.

Commit list

(65 squashed commits — see the PR's commit tab for the chronological log; each commit corresponds to one fork-side PR.)

Introduces a per-slab client-meta provider that costs exactly one pointer of inline metadata (sizeof(void*)) regardless of the slab's object count. The backing T[] array is lazily materialised on the first get() call and published via a double-checked compare-and-swap against an inline stl::Atomic<T*>; concurrent first-touches resolve without a lock and the losing thread decommits its temporary mapping with PAL::notify_not_using. The lazy install path goes directly to DefaultPal (reserve + notify_using <YesZero>) so it cannot recurse into user malloc, and the per-slab overhead when never queried is one nullptr — appropriate for sampled heap-profiling metadata that only a small fraction of slabs ever touch. The primitive is purely additive: it is not yet wired into any Config and no SNMALLOC_PROFILE gating is introduced (Phase 3 concerns). Existing NoClientMetaDataProvider / ArrayClientMetaDataProvider, their call sites in FrontendSlabMetadata::get_meta_for_object, and the global Config selection are unchanged. Wiring this provider up will require threading the per-slab object count from the pagemap MetaEntry through get_meta_for_object to the new get(StorageType*, size_t, size_t) overload. ClickUp: 86ahrfwmq

Introduces the StackWalker abstraction described in .claude/research/heap-profiling/stack-walker.md as a new PAL header (pal_stack_walker.h, included from pal/pal.h). This is the first concrete piece of Phase 2.1 of the heap-profiling milestone (ClickUp 86ahzwhq5). Walker capabilities: - FramePointerWalker: pure dependent-load loop with per-frame validation (alignment, strict-monotonic FP, stack-range, sentinel null-FP). Reads fp[0] (saved FP) and fp[1] (saved LR) from canonical aarch64/x86_64 frame headers. On aarch64, unconditionally strips Pointer-Authentication Code bits from the saved LR via ptrauth_strip on Apple and xpaclri (HINT #7) elsewhere -- both decode to a NOP on cores without FEAT_PAuth, so cost is zero on non-PAC hardware. - POD thread_local stack-bounds cache populated lazily via pthread_get_stackaddr_np on macOS and pthread_getattr_np on Linux. Zero-initialised; no constructor, no __cxa_thread_atexit, no malloc on first access -- the only construction pattern provably reentrancy-safe from inside an allocator's sample path. - NullStackWalker fallback for unsupported targets (Windows, FreeBSD, OpenEnclave, CHERI/Morello, non-x86_64/aarch64). Returns 0 frames. - Async-signal-safe: no malloc, no locks, no syscalls, no TLS construction. Graceful degradation on broken FP chains. - Selection at compile time via preprocessor macros. No CMake option in this commit (deferred -- see "what's NOT done" below). - A free function snmalloc::profile::stack_walk() wraps the default walker for callers that don't need to pick one explicitly. Supported arches: x86_64 + aarch64 on Linux + macOS. Microbenchmark (src/test/perf/stack_walker_bench/): - Recursive call-chain builder with NOINLINE + tail-call-prevention asm-barriers. Sweeps depths 2/4/8/16/32, takes min of 5 repeats per depth, reports total ns / ns-per-iter / ns-per-frame and a two-point slope estimate. - Auto-discovered by the existing perf harness; added to TESTLIB_ONLY_TESTS so it shares an object library across fast/check flavours. - Asserts ns/frame < 50 (5x headroom over the ~10 ns/frame design target). Skipped under --smoke and Debug builds. - Measured on Apple Silicon M-series: ~0.5-1.0 ns/frame steady state (deepest depth 35 captured frames, total ~21 us / 1M iterations = 20.6 ns/iter, slope 0.53 ns/frame). Well under the design target. What is NOT done in this commit: - The walker is NOT wired into any allocator path. No SNMALLOC_PROFILE gating exists yet; that lives in Phase 3. - The matching CMake plumbing -- a SNMALLOC_PROFILE_STACK_WALKER option (fp / null / auto) and -fno-omit-frame-pointer injection for snmalloc TUs -- is left for a follow-up. The header today is controlled by SNMALLOC_PROFILE_STACK_WALKER_FP / SNMALLOC_PROFILE_STACK_WALKER_NULL preprocessor overrides plus an arch/OS auto-detection default. - Stack-capture-at-sample-hit (ClickUp 86ahzwhq5's sibling 86ahzwhmh) is NOT included; it requires the Sampler from Phase 2.2. Files: - src/snmalloc/pal/pal_stack_walker.h (new, header-only) - src/snmalloc/pal/pal.h (one #include line) - src/test/perf/stack_walker_bench/stack_walker_bench.cc (new) - CMakeLists.txt (one-word addition to TESTLIB_ONLY_TESTS) ClickUp: 86ahzwhq5

#4) Pure infrastructure for the heap-profiling milestone. Adds the per-thread Poisson sampler, the SampledAlloc record + pre-allocated lock-free node pool, the global lock-free intrusive list of currently-sampled allocations, and the per-thread re-entrancy guard. Wires the FramePointerWalker from Phase 2.1 into the sampler so a sample fire captures a stack at the allocation site. Purely additive: nothing is plumbed into snmalloc::alloc() / dealloc() in this commit, no SNMALLOC_PROFILE gating yet (that is Phase 3 work), and existing allocator behaviour is unchanged. All new code lives in src/snmalloc/profile/, kept separate from src/snmalloc/pal/ because the profiler is policy rather than platform abstraction. Components: - Sampler (sampler.h) Per-thread Poisson sampler. Fast path is one int64_t subtract + one signed-compare branch (~3-4 cycles). Slow path draws Exp(rate) via libm log on a doubles-in-(0,1] conversion of the xoshiro256** output; computes weight as `rate - bytes_until_sample + requested_size` (tcmalloc convention, bytes-of-request); acquires a node from the global NodePool; captures a stack via FramePointerWalker (skip=1); publishes on the global SampledList. First-sample bootstrap draws the initial countdown from Exp(rate) so the very first sample is unbiased -- the single most commonly-mishandled detail in DIY samplers. - SampledAlloc (sampled_alloc.h) Cache-line aligned record holding alloc address, requested + allocated sizes, weight, the sampling interval that was in force at capture time (so a later set_sampling_rate doesn't mis-weight already-captured samples), tid, monotonic alloc_seq, captured stack frames, and an atomic NodeState. Stack depth knob defaults to 32 frames (SNMALLOC_PROFILE_STACK_FRAMES). - NodePool (node_pool.h) Fixed-capacity lock-free Treiber stack of SampledAlloc nodes with a 32-bit ABA tag packed into the high half of a 64-bit head word. Backing storage allocated directly via mmap / VirtualAlloc -- the profiler must never re-enter snmalloc's own allocator. acquire() returns nullptr and bumps a drop counter on exhaustion; callers silently skip the sample. - SampledList (sampled_list.h) Lock-free intrusive singly-linked list. Tombstone bit packed into the low bit of `SampledAlloc::next` so liveness and link come from a single acquire-load. remove() is a CAS on the tombstone bit (linearisation point) followed by a best-effort linear unlink; lost unlink races leave the node as a tombstoned skip until the next walk reaps it. Cross-thread remove works because no thread ownership is implied -- whichever thread does the dealloc does the remove. No reclamation needed: node memory is owned by NodePool, not the list. - ReentrancyGuard (reentrancy_guard.h) POD `thread_local uint8_t` (lives in .tbss, zero-initialised by the loader, no first-touch malloc, no __cxa_thread_atexit registration). RAII guard sets the flag on the sampler slow path so any transitive allocator call (e.g. glibc backtrace() lazy thread-cache init, or NodePool's first-call mmap) short-circuits via the fast-path `sampler_reentered()` check. Same pattern as pal_stack_walker.h's stack-bounds cache. Test (src/test/func/profile_sampler/profile_sampler.cc): * NodePool basic: exhaustion, drop counter, alloc_seq monotonicity, full release+reacquire round-trip. * Reentrancy guard: TLS flag toggle + record_alloc short-circuit under an active guard. * SampledList single-threaded push/remove/snapshot + double-remove is a no-op + drain. * SampledList concurrent push (4 threads x 512 allocs) -- all 2048 nodes observed. * SampledList concurrent push + cross-thread remove (4 threads pushing, 4 different threads removing the other thread's nodes) -- list ends up empty. * Sampler first-sample bootstrap (100k fresh Samplers, each does one record_alloc(64) at T=4096) -- observed hit count 5-sigma window catches both the "all-zero" bug (deterministic bootstrap) and the "auto-sample-first" bug. * Sampler distribution (4M record_alloc(64) at T=512KiB) -- observed sample count and summed weight both within statistical tolerance of the analytic expectation. * Rate change (3M allocs at T=64KiB then 3M at T=256KiB) -- weight sums correct for both phases, hits inversely proportional to rate. * End-to-end: Sampler::record_alloc fires, captured node is reachable via SamplerGlobals::list().snapshot() with non-zero stack_depth. Tickets: 86ahrfw19 (Sampler) 86ahrfw3f (SampledAlloc + NodePool) 86ahrfw44 (SampledList) 86ahrfw58 (ReentrancyGuard) 86ahrfw78 (unit tests) 86ahzwhmh (stack capture wiring) 86ahzwhtq (weight contract)

- Add `option(SNMALLOC_PROFILE ...)` (default OFF) in CMakeLists.txt alongside SNMALLOC_COVERAGE. - Add `add_as_define(SNMALLOC_PROFILE)` next to SNMALLOC_TRACING so the flag is plumbed through as a pure compile-time define on the snmalloc INTERFACE target. No source code reads it yet; alloc/dealloc hooks land in Phase 3.3. - Add three CI matrix entries that mirror the existing "Traced Build" shape (build-only, reusable-cmake-build.yml, Release): * ubuntu-24.04 / gcc / -DSNMALLOC_PROFILE=ON * ubuntu-24.04 / clang / -DSNMALLOC_PROFILE=ON * macos-15 / clang / -DSNMALLOC_PROFILE=ON Verified locally on macOS arm64: configure + full build + all 86 ctest targets pass with -DSNMALLOC_PROFILE=ON, and the default (OFF) build is byte-identical with respect to the new flag (define absent).

- New snmalloc::profile::record_dealloc<Config>(void*) free function in src/snmalloc/profile/record.h. Compiles to a no-op for configs whose ClientMeta is not LazyArrayClientMetaDataProvider<SampledAlloc-slot>, so the default snmalloc::Config sees zero cost. - record_dealloc body splits into find_profile_slot (Config-specific pagemap walk) and clear_profile_slot (Config-agnostic atomic-CAS + SampledList::remove + NodePool::release), with the latter callable directly from tests. - H1 hook installed at the dealloc waist in Allocator::dealloc(void*) (mem/corealloc.h:1025), gated by SNMALLOC_PROFILE. Fires before any existing dealloc logic so profile-side cleanup observes the live pagemap, and is itself safe under recursive entry via the per-thread ReentrancyGuard. - record.h is intentionally lightweight; including commonconfig.h there would create a cycle (commonconfig -> mem/mem -> corealloc -> record). Instead corealloc.h forward-declares the template, and backend_helpers/backend_helpers.h pulls the full definition in once LazyArrayClientMetaDataProvider is visible. - record_alloc stays a stub: full alloc-side wiring lands in Phase 3.3. - New test src/test/func/profile_record/profile_record.cc covers the null-slot no-op, populated-slot drain, multi-threaded double-free CAS race, default-config compile-time no-op, ReentrancyGuard short-circuit and end-to-end libc::malloc/libc::free crash-freedom. - Default (OFF) build remains byte-identical to pre-Phase-3.1: the H1 call site is behind #ifdef SNMALLOC_PROFILE, and SNMALLOC_PROFILE=ON with the default NoClientMetaDataProvider Config inlines the if-constexpr branch into nothing (verified: same binary size for the default-config test executable in OFF vs ON builds). - All existing tests pass under both -DSNMALLOC_PROFILE=OFF (88/88) and -DSNMALLOC_PROFILE=ON (88/88), -fast and -check variants.

- Add SNMALLOC_PROFILE-gated record_dealloc<Config>(msg) hook in Allocator::handle_dealloc_remote, just before the splice via dealloc_local_objects_fast on the destination thread. Catches the remote-ingest fast path -- the milestone-flagged critical free path for cross-thread frees. - Reuses the Phase 3.1 record_dealloc / clear_profile_slot machinery unchanged; the atomic CAS in clear_profile_slot keeps H1 + H2 idempotent w.r.t. the same pointer. - Header surface unchanged; the SNMALLOC_PROFILE off build is byte-identical to pre-Phase-3.2. - New func test profile_remote_dealloc covers: single-threaded baseline, H1/H2 sequential clear idempotence, a 4 producer + 4 consumer cross-thread alloc/free stress test, and the default-config compile-time no-op contract.

- Hook the user-facing snmalloc::alloc(size_t), alloc<size>(), alloc(smallsizeclass_t), and alloc_aligned wrappers in global/globalalloc.h with a profile::record_alloc<Config>(...) call gated on #ifdef SNMALLOC_PROFILE. One hook per wrapper covers all public alloc entry points -- malloc/calloc/realloc, operator new, jemalloc/Rust shims, BSD valloc/pvalloc, NetBSD reallocarr -- since they all funnel through these chokepoints. - Wire the record_alloc body in profile/record.h: tick the per-thread Sampler (which already publishes the SampledAlloc on the global list), then install the node into the per-object profile slot via a new find_or_install_profile_slot<Config>(p) helper that forces the lazy backing array into existence on first sight. Compile-time no-op when the config does not carry the lazy ProfileSlot provider. - Add src/test/func/profile_e2e/profile_e2e.cc: an end-to-end test that defines its own profile-enabled Config via SNMALLOC_PROVIDE_OWN_CONFIG and exercises the full alloc + free pipeline. Covers single-threaded rate accuracy, multi-threaded drain-to-empty, mixed entry-point coverage (malloc / calloc / aligned_alloc), and the rate=0 sampling-disabled fast path. Default-Config build is byte-identical to Phase 3.2: every new code path is gated on either #ifdef SNMALLOC_PROFILE or config_has_profile_slot_v, so OFF builds and default-Config ON builds see no behaviour change.

) - Install H3 heap-profile hook in Allocator::dealloc_remote on the SecondaryAllocator branch (catches GWP-ASan / non-snmalloc pointers that bypass the snmalloc-owned pagemap). - Install H4 heap-profile hook in Allocator::dealloc_remote_slow's lazy-init recursion lambda, immediately before the recursive a->dealloc(p). Pairs with H1 to keep the recursion-guard tight. - Both hooks live entirely under #ifdef SNMALLOC_PROFILE; default Config OFF build is byte-identical to Phase 3.3. - Both hooks reuse profile::record_dealloc<Config>; idempotence is guaranteed by the CAS in clear_profile_slot and the per-thread ReentrancyGuard. No new state machines, no new allocations on the free path. - New test: src/test/func/profile_h3_h4/profile_h3_h4.cc. Triple- and quadruple-clear idempotence, nullptr robustness, fresh-thread remote-free stress, default-Config compile-time no-op. - New test: src/test/func/profile_integration/profile_integration.cc. 16 threads x 100k allocs x varied size ladder, ~50/50 same-thread vs cross-thread free, plus a one-producer-many-consumers stress. Asserts sample count within 6 sigma of Poisson expectation, post-free leak <= documented tolerance (<= 1% + 4), and that the global SampledList drains to zero. Sampling rate (128 KiB) sized so expected samples stay well below the NodePool capacity ceiling. - Wires ticket 86ahrfx9g (multi-threaded alloc + cross-thread dealloc integration stress). - Observed teardown-straggler ratio improves from ~1/1250 in the Phase 3.3 8-thread e2e test to ~1/4000 in the new 16-thread integration test, a ~3x reduction.

- Expose `sn_rust_profile_*` C ABI surface in src/snmalloc/override/rust.cc: supported, set_sampling_rate, get_sampling_rate, snapshot_begin, snapshot_count, snapshot_get, snapshot_end. New header src/snmalloc/override/rust_profile.h defines SnRustProfileRawSample (alloc_ptr, requested_size, allocated_size, weight, stack_depth, stack) with SNMALLOC_PROFILE_STACK_FRAMES matching the Phase 2 sampled_alloc.h constant. - When SNMALLOC_PROFILE=OFF every export except `supported` is a stub returning zero / nullptr / false. Symbols are always linkable so the Rust crate's FFI does not need #[cfg] gating in extern blocks. - When SNMALLOC_PROFILE=ON the bodies delegate to existing Phase 2 / 3 machinery (Sampler::{set,get}_sampling_rate, SampledList::snapshot, SampledList::debug_count). No new C++ infrastructure introduced. - Add `profiling` cargo feature to snmalloc-sys and the higher-level snmalloc-rs crate. The feature passes SNMALLOC_PROFILE=ON to cmake (or SNMALLOC_PROFILE=1 to the cc backend) and exposes SnRustProfileRawSample plus the sn_rust_profile_* extern declarations in snmalloc-sys/src/lib.rs. - Cover the FFI surface with a small Rust smoke-test module (#[cfg(feature = "profiling")]) that exercises supported(), the sampling-rate roundtrip, and the snapshot lifecycle. - No Rust-side safe wrapper yet -- that is Phase 4.1. Verified: - ctest --test-dir build (SNMALLOC_PROFILE=OFF): 96/96 passed. - ctest --test-dir build-profile (SNMALLOC_PROFILE=ON): 96/96 passed. - cargo test --all (no profiling feature): 12 passed across all crates. - cargo test --all --features profiling: 15 passed across all crates (4 baseline snmalloc-sys + 3 new profile tests + everything else).

….1) (#11) - New snmalloc-rs/src/profile.rs: idiomatic safe wrapper over the sn_rust_profile_* FFI surface from Phase 4.0. - HeapProfile: owned, cloneable snapshot of live sampled allocations with len/is_empty/samples accessors plus u128 total_allocated_bytes and total_requested_bytes aggregators (saturating math, divide-by- zero-safe). - BtSample: per-allocation record with alloc_ptr, requested_size, allocated_size, weight, and Vec<*const u8> stack frames. Send + Sync via unsafe impls (raw pointers used opaquely, never deref'd). - SnMalloc::snapshot / set_sampling_rate / sampling_rate / profiling_supported: thin methods on the existing global allocator type. snapshot() uses an internal RawSnapshotGuard whose Drop releases the FFI handle even on panic mid-collection. - snmalloc-sys/src/lib.rs: drop the #[cfg(feature = "profiling")] gate on the SnRustProfileRawSample struct and the sn_rust_profile_* extern block. The C symbols are unconditional stubs when SNMALLOC_PROFILE is off, so the Rust bindings should be too -- this lets the safe wrapper present a uniform API in both feature-on and feature-off builds (empty profile, sampling_rate fixed at 0, profiling_supported() returns false). - snmalloc-rs/src/lib.rs: expose the new profile module + re-export HeapProfile / BtSample. - snmalloc-rs/tests/profile_snapshot.rs: integration tests covering feature-off quiescence (snapshot empty, rate fixed at 0, supported() == false), the sampling-rate round-trip when supported, and a #[ignore]'d live-sampling end-to-end test. - The live-sampling test is ignored because the rust.cc shim is built with the default snmalloc::Config (NoClientMetaDataProvider), which makes config_has_profile_slot_v false and the alloc hook a compile-time no-op. Wiring the Rust shim to use LazyArrayClientMetaDataProvider<ProfileSlot> is Phase 4.2 -- the Phase 4.1 ticket explicitly forbids modifying rust.cc / rust_profile.h. See the ignore reason on live_sampling_run for the full path. - All 12 snmalloc-rs unit tests, 4 (+ 1 ignored) integration tests, 4 snmalloc-sys rust_tests, and the lib doc test pass with both feature off and feature on. All 74 C++ ctest cases continue to pass in both SNMALLOC_PROFILE=ON and OFF build dirs.

…test (Phase 4.2) (#12) - src/snmalloc/override/rust.cc: when SNMALLOC_PROFILE is defined, predeclare snmalloc::Config as StandardConfigClientMeta<LazyArrayClientMetaDataProvider< std::atomic<profile::SampledAlloc*>>> and define SNMALLOC_PROVIDE_OWN_CONFIG before the snmalloc.h / malloc.cc includes. This flips config_has_profile_slot_v<Config> to true so the alloc/dealloc hooks in profile/record.h emit real samples on the rust shim's allocation paths. When SNMALLOC_PROFILE is undefined the file is byte-identical to its pre-Phase-4.2 form. - snmalloc-rs/tests/profile_snapshot.rs: drop the Phase-4.2 #[ignore] on live_sampling_run; the test now exercises the full pipeline, asserts the live snapshot count lies within a 6-sigma Poisson envelope of the expected sample count, and verifies the snapshot drains after every allocation is freed. Header comment updated to match the new wiring. - Verified: C++ 96/96 ctest pass with SNMALLOC_PROFILE=OFF; 96/96 pass with SNMALLOC_PROFILE=ON. Rust 12+1+5+4 tests pass with the profiling feature off; with the feature on the same suite plus three snmalloc-sys profile tests (totalling 12+1+5+7) pass and live_sampling_run observes ~1574 samples (expected ~1562, +/-6 sigma window [~1325, ~1800]) and drains to 0 post-free.

…e 4.3) (#13) - snmalloc-rs/src/profile.rs: new Weight enum (Requested / Allocated; Default = Allocated, matching the default UI view documented in profile-weight.md) and HeapProfile::write_flamegraph / write_flamegraph_with methods. Output is Brendan Gregg's collapsed / folded-stack format: one line per unique stack as "<frame_root>;<frame_mid>;<frame_leaf> <weight>", root-first, each frame rendered as a zero-padded 16-hex code pointer (0x000000...). Identical stacks collapse into a single line with summed weights via a BTreeMap keyed on the pre-rendered hex form, which gives deterministic lex-ordered output for golden tests and version-control diffs. No new dependencies -- uses std::io::Write only (gated by extern crate std on this no_std crate). - snmalloc-rs/src/lib.rs: re-export the new Weight enum alongside HeapProfile / BtSample. - snmalloc-rs/tests/profile_accuracy.rs: new integration suite. * accuracy_single_threaded -- 100_000 x 64B allocations at rate 4096 must yield a sample count inside a 6-sigma Poisson envelope of lambda = 1562.5, and sum(weight) must match 6.4 MiB to within 5%. * accuracy_multi_threaded -- 8 threads x 10_000 x 64B at the same rate; expected ~1250 samples +/- 6 sigma. Documents the known O(1/N) per-thread teardown straggler from Phase 3.4 inline. * flamegraph_correctness_over_live_snapshot -- captures a snapshot with >= 100 samples, calls write_flamegraph into a Vec<u8>, parses every line as "<hex-stack> <weight>", asserts each frame is "0x" + 16 hex digits, asserts no stack appears twice (the collapse step worked), and asserts the sum of folded weights equals HeapProfile::total_allocated_bytes under the default projection. A second pass with Weight::Requested verifies the explicit projection matches total_requested_bytes. * flamegraph_empty_snapshot_writes_nothing -- the no-op-safe contract for the profiling-feature-off build. All four tests acquire a process-wide accuracy_lock() so they do not race against each other for the global sampler state when cargo runs them in parallel, and each subtracts a baseline snapshot taken with sampling momentarily disabled so any leftover samples from sibling tests in the same binary do not perturb the Poisson assertions. Tests are no-op on the profiling-feature-off build. - Speedscope JSON export deferred to Phase 4.5+: speedscope already imports the folded format directly, and a faithful JSON profile schema is better layered on top of the symbolicator that lands in 4.5. Documented in the write_flamegraph rustdoc. Verified: - ctest --test-dir build (SNMALLOC_PROFILE=OFF): 96/96 passed. - ctest --test-dir build-profile (SNMALLOC_PROFILE=ON): 96/96 passed. - cargo test --all (no profiling feature): all crates green, 4 profile_accuracy tests no-op pass, profile.rs unit tests including 6 new flamegraph + Weight tests pass. - cargo test --all --features profiling: all crates green, all 4 profile_accuracy tests pass with live sampling. - cargo doc --features profiling --no-deps: clean build, all new rustdoc renders.

- Add optional `symbolicate` Cargo feature that pulls in the `backtrace` crate as a dependency only when enabled. - Add `ResolvedFrame { address, name, file, line }` for the per-frame metadata returned by the symbolicator. - Add `HeapProfile::symbolize()` returning `HashMap<*const u8, ResolvedFrame>` keyed by raw frame addresses. Each unique frame is resolved once via `backtrace::resolve`. - Add `HeapProfile::write_flamegraph_symbolized()` that renders the same folded-stack format as `write_flamegraph` but substitutes resolved function names for hex code pointers, falling back to the hex rendering when a frame has no resolved name. `;` and space in resolved names are sanitised to `_` so the folded format stays unambiguous. - Sum of weights from `write_flamegraph_symbolized` equals `total_allocated_bytes`, matching `write_flamegraph` under the documented default projection. - Unit tests: smoke-test symbol resolution via a `#[inline(never)]` probe that captures its own backtrace, plus empty-profile, unresolved-frame, and hex-fallback contracts. - Integration test (`tests/profile_symbolize.rs`): collect a live snapshot at the same rate/workload as `profile_accuracy`, verify >=50% of unique frames resolve to a non-None name, and verify `write_flamegraph_symbolized` parses cleanly, has no duplicate stacks, and preserves total weight.

- Add snmalloc-rs/src/config.rs introducing ProfileConfig (a typed, Default-impled struct of sampling_rate + enable_from_env) along with SnMalloc::configure_profiling and SnMalloc::init_profiling_from_env so callers don't have to wire set_sampling_rate by hand after installing the global allocator. - Honour SNMALLOC_PROFILE_RATE (parseable integer wins, including 0) and SNMALLOC_PROFILE_ENABLE (truthy aliases 1/true/yes, case-insensitive, whitespace trimmed) when init_profiling_from_env is called; the resolver is read-only, panic-free, and a no-op when neither var is set. Default rate when ENABLE=1 with no RATE is 524288 bytes (512 KiB). - No #[ctor] / static init -- explicit call from main is documented as cheaper and easier to reason about than allocator-vs-ctor ordering games. - Re-export ProfileConfig + ENV_PROFILE_RATE + ENV_PROFILE_ENABLE from the crate root. - Unit tests in src/config.rs cover Default, with_sampling_rate, configure_profiling round-trip + idempotency + zero-disables, and parse_bool_env recognition. - New integration test tests/profile_runtime_config.rs serialises env-var manipulation with a local OnceLock<Mutex<()>> and a Drop guard that restores both env vars and the global sampling rate, so it doesn't race against profile_accuracy.rs sibling tests. - All tests pass under both cargo test and cargo test --features profiling; cargo doc --features profiling --no-deps is warning-free.

…ase 4.6) (#16) - snmalloc-rs/Cargo.toml: add `inferno = "0.11"` as a dev-dependency (test-only; never appears in the published crate's transitive deps). Version pin documented inline -- 0.11 keeps MSRV aligned with the rest of the workspace, while later 0.12.x bumps `rust-version` to 1.71 and pulls in additional crossbeam transitive deps we don't otherwise need. - snmalloc-rs/tests/profile_viewer_roundtrip.rs: new integration suite asserting that the folded-stack output emitted by Phase 4.3's `HeapProfile::write_flamegraph` is consumable by two real viewers in the Rust profiling ecosystem. Test-only -- no public API on `HeapProfile` / `SnMalloc` is added, and `src/profile.rs` is not touched. * inferno_roundtrip -- captures a >=50-sample snapshot, writes its folded form into a `Vec<u8>`, hands it to `inferno::flamegraph ::from_reader` with `Options::default()`, and asserts the rendered SVG contains a `<svg` root and at least one `<g` stack-frame group node. Confirms the round-trip from folded bytes to SVG works without any post-processing. * speedscope_folded_import -- re-implements the regex `^([^\s]+) (\d+)$` that speedscope's "Brendan Gregg's collapsed stack format" importer uses (per its wiki) and asserts >=95% of folded lines match. speedscope itself runs in a browser/wasm context we can't drive in CI, so the conformance check is the next best thing. * round_trip_weight_invariance -- regression guard for the Phase 4.3 BTreeMap collapse step: sum of folded weights over a real-workload snapshot must equal `HeapProfile::total_allocated_bytes` exactly. * empty_snapshot_viewer_safety -- runs in both feature configurations (no `#[cfg(feature = "profiling")]` gate). Confirms `write_flamegraph` on an empty profile writes zero bytes and that inferno cleanly returns `Err` rather than panicking when handed the resulting empty stream. Covers the OFF-build path where every snapshot is empty by construction. - Workload calibration: 5_000 x 64-byte allocations at sampling rate 512 -> ~625 expected samples (well above the 50-sample floor Phase 4.6 requires). Smaller than the 100k workload in profile_accuracy.rs to keep CPU contention low when `cargo test --all --features profiling` runs the two test binaries in parallel. Workload-driving helpers live in a `#[cfg(feature = "profiling")]` module to avoid dead-code warnings on the OFF build. Verified: - cargo test --all (profiling OFF): all binaries green, including the new profile_viewer_roundtrip binary running just empty_snapshot_viewer_safety. - cargo test --all --features profiling: stable across 5 back-to-back runs; all 4 new tests pass, all pre-existing tests pass. - cargo test --features profiling --test profile_viewer_roundtrip: 4 passed, 0 failed. - No new compiler warnings in either feature configuration.

- New AllocationSampleList primitive: fixed-K (K=4) atomic slot array of noexcept callbacks invoked once per sampled allocation. Lock-free register/unregister via per-slot CAS; broadcast iterates with relaxed loads. Documented chosen storage and the no-allocation handler contract. - record_alloc now broadcasts the just-installed SampledAlloc to every registered handler, alloc-only (matches tcmalloc semantics). Broadcast is wrapped in its own ReentrancyGuard so a handler that allocates short-circuits the sampler via the existing reentry check. - C exports sn_rust_profile_streaming_{start,stop} gated by SNMALLOC_PROFILE; a single FFI user callback at a time is bridged through a noexcept shim that converts SampledAlloc to SnRustProfileRawSample. Stubs preserve link-compatibility in the SNMALLOC_PROFILE=OFF build. - rust_profile.h declares the new entry points and the streaming contract. - New profile_streaming ctest covers per-sample fan-out, parity with the SampledList live count, unregister-stops-broadcast, multi-subscriber fan-out, slot-exhaustion rejection, and the OFF-build smoke arm.

- New pub(crate) module snmalloc-rs/src/pprof.rs hand-rolls the protobuf3 wire format (varint + length-delimited) for the subset of Google's pprof Profile schema needed for snmalloc heap snapshots; no prost/flate2 dependencies added. - HeapProfile::write_pprof emits two sample_type axes (alloc_objects/count, alloc_space/bytes) plus per-stack location/function chains; output is uncompressed (callers can wrap in GzEncoder if they want .pb.gz). - Unsymbolicated frames render function name as 0x..hex.. with empty filename/line, mirroring write_flamegraph; symbolicated frames use names from HeapProfile::symbolize when available. - Tests: 6 unit tests in src/pprof.rs (varint, empty profile, alloc_space-axis invariance under both Weight projections, function/location dedup, string-table slot-0 contract) + 3 integration tests in tests/profile_pprof.rs gated on --features profiling (smoke, empty snapshot, total_weight == total_allocated_bytes).

…rhead (#19) - Phase 7.1: hoist bytes_until_sample into a dedicated alignas(64/128) SamplerHotState struct (128 bytes on Apple Silicon, 64 elsewhere) so the per-thread fast-path counter sits on its own cache line and cannot false-share with the colder Sampler tail (PRNG state, last_sample_, initialized_) or with concurrent dealloc slot-clear traffic. Counter is the first member of the cache-aligned region (offset 0). Adds a SNMALLOC_LIKELY annotation on the hot subtract+compare. - Phase 7.3: new func test profile_overhead asserting a) sizeof(Config::PagemapEntry) is unchanged vs. an explicit StandardConfigClientMeta<NoClientMetaDataProvider> — proves the lazy provider type is compiled in but contributes zero bytes when profiling is off. b) bytes_until_sample lives at offset 0 of the cache-aligned hot state (offsetof check). c) Runtime gate: 1M alloc/free pairs of size 32 under Sampler::set_sampling_rate(0) (off) and Sampler::set_sampling_rate (2^40) (on, never fires) — assert ns/alloc ratio < 1.05, i.e. no branch-misprediction storm in the dealloc null-slot fast-path.

- Add snmalloc-sys extern "C" decls for sn_rust_profile_streaming_start / sn_rust_profile_streaming_stop, gated on the `profiling` feature. - Introduce `snmalloc-rs::streaming` exposing `ProfilingSession` (RAII handle) plus a borrowed `StreamSample<'_>` view of the raw FFI sample. Single-session-at-a-time semantics enforced through a process-global `Mutex<Option<Handler>>`; second `start()` returns `StreamingError::AlreadyActive`. - Trampoline is a fixed `extern "C"` function that locks the slot, dispatches into the boxed `Fn` and catches panics so unwinds never cross the FFI boundary. Handler bounds are `Send + Sync + 'static`. - Drop unregisters from the C side, then clears the slot so a fresh `ProfilingSession::start` can succeed. - Re-export `ProfilingSession`, `StreamSample`, `StreamingError` from the crate root under `#[cfg(feature = "profiling")]`. - Add `tests/profile_streaming.rs` covering: smoke handler-invocation, double-start AlreadyActive recovery, drop-unregisters guarantee, and thread-safety under a concurrent allocator workload.

- New snmalloc-rs/tests/profile_pprof_roundtrip.rs (profiling-gated) - `pprof_roundtrip_via_go_tool`: runs a small workload, writes the pprof bytes to a unique tempfile (no `tempfile` dep), and invokes `go tool pprof -raw <file>`. Asserts exit 0 and that stdout contains a structural marker (`Samples:`, `sample_type`, `PeriodType`, or one of our axis names). - `empty_snapshot_pprof_roundtrip`: same path but on a default `HeapProfile`; the metadata-only Profile must still parse. - `skip_if_no_go` helper: probes `go version` and skips with an `eprintln!` when Go is not on PATH. Keeps cargo test green on developer machines / CI images without a Go toolchain. - No new dev-deps; stdlib only. Tempfile path uses `temp_dir() + pid + SystemTime nanos`. - Workload + process-wide mutex pattern mirrors profile_pprof.rs and profile_viewer_roundtrip.rs.

- benches/profile_bench.rs: three groups (small_allocs 32B, medium_allocs 4K, mixed 16..16384) x three variants (profile-off, profile-on-inactive at usize::MAX rate, profile-on-active at 512 KiB default rate). Hand-rolled main emits a stderr summary pointing at the ratio_idle metric used by CI to gate idle overhead at <= 5%. - Cargo.toml: criterion 0.5 (no default features) as a dev-dep, [[bench]] entry with harness = false. - benches/README.md: short doc on running, what ratio_idle means, why absolute numbers are host-specific.

- Add `profiling` job to rust.yml: cargo build/test --features profiling on ubuntu-latest, macos-14, macos-15 (release + debug, stable toolchain). - Confirms main.yml already covers SNMALLOC_PROFILE=ON for ubuntu-24.04 gcc/clang and macos-15 clang (added in Phase 3.0 + earlier macOS edit); no main.yml edits required. - Restricted to Linux + macOS per task scope; Windows profile coverage can be added later if needed.

- 8 worker threads tight-loop alloc/free at sizes [16,64,256,1024,16384] - 9th sampler thread snapshots SampledList every ~10ms for 5s - exercises H1-H4 dealloc hooks + lock-free SampledList under churn - TSan/ASan-clean by construction; sanitizer cmd lines documented inline - SNMALLOC_PROFILE=OFF path collapses to a "skipped" stub

…#25) - README.md: new H2 'Heap Profiling' section covering SNMALLOC_PROFILE CMake flag, default 524288-byte Poisson sampling rate, C ABI exports, pointer to the Rust crate, supported output formats (folded flamegraph + pprof), and the <1% overhead claim citing the Phase 7 bench suite. - snmalloc-rs/README.md: extended with a 'Heap Profiling' section documenting the profiling and symbolicate Cargo features, snapshot + flamegraph quick start, streaming ProfilingSession, env-var-driven init_profiling_from_env, pprof output via write_pprof, symbolicated flamegraphs, and the graceful feature-off fallbacks. - All Rust code samples spot-checked against the actual public surface in snmalloc-rs/src/{lib,profile,config,streaming}.rs.

- Crate-level //! Heap Profiling section with end-to-end snapshot + flamegraph example - HeapProfile struct / samples() / total_allocated_bytes() examples - write_flamegraph and write_pprof File / Vec<u8> examples (no_run) - Weight enum example showing Allocated vs Requested - ProfilingSession::start example with shared atomic counter + RAII drop - StreamSample accessor example covering alloc_ptr / requested_size / allocated_size / weight / stack - SnMalloc::configure_profiling and init_profiling_from_env examples - All examples compile under both --features profiling and the default build; cargo test --doc passes 10/10 (default) and 12/12 (profiling feature on)

- Replace the hard 5% bound on sum(weight) with the derived 6-sigma envelope of the Poisson unbiased-sum estimator (Var ~ N*SIZE*RATE). At the chosen constants (N=100_000, SIZE=64, RATE=4096) the old 5% bound was only ~1.97 sigma, giving a ~5% per-run flake rate under sibling cargo-test CPU contention. The new window is [5_428_293, 7_371_707] bytes around the 6_400_000 expected. - Verified by running the test 50x in a tight loop: 0 failures. - Ticket: 86aj0h83a.

- Adds two ubuntu-24.04 clang Debug matrix legs to the existing ubuntu job in .github/workflows/main.yml so the heap-profiling code paths exercised by perf-profile_stress and the func-profile_* suite are run under ThreadSanitizer and AddressSanitizer. - Both legs configure -DSNMALLOC_PROFILE=ON and the project's existing SNMALLOC_SANITIZER cmake option (=thread / =address) instead of raw CMAKE_CXX_FLAGS=-fsanitize=...; this is the idiomatic mechanism already used by the existing "TSan + UBSan" matrix entries (CMakeLists.txt:73-75, 580-606, 668-672) and correctly wires -fsanitize through to test-target compile and link lines plus the SNMALLOC_THREAD_SANITIZER_ENABLED define the codebase guards on. - The TSan leg installs libc++-dev and uses -stdlib=libc++ to match the existing TSan + UBSan legs (libstdc++ on Ubuntu is not TSan-instrumented). The ASan leg uses the default libstdc++ runtime, which is ASan-compatible. - Both legs pass `-R profile_` via test-extra-args so ctest runs only the profile suite (perf-profile_stress-{fast,check} + func-profile_*). This bounds sanitizer overhead within the CI time budget while still exercising the new snapshot-under- churn workload from PR #24. - Local validation: configured + built + ran perf-profile_stress-fast on darwin-arm64 with -DSNMALLOC_SANITIZER=address; the fast variant ran ~5s under ASan with no diagnostics. TSan was not validated locally because the macOS toolchain available here does not ship a TSan-instrumented libc++; relying on the GitHub ubuntu-24.04 runner for that leg as called out in the ticket.

- New HeapProfile::write_pprof_gz<W: Write>(&mut self, w, weight) wraps the uncompressed write_pprof in flate2::write::GzEncoder so callers can produce the .pb.gz encoding accepted natively by Pyroscope, Polar Signals Cloud, Parca, Speedscope, and Datadog continuous profiler, as well as `go tool pprof`. - flate2 added as an optional dep gated by the existing `profiling` Cargo feature; deliberately not a separate feature, since gzipped pprof is the dominant on-the-wire encoding and splitting it off would multiply the build matrix without a meaningful payoff. - Three new integration tests in tests/profile_pprof_gz.rs covering the gzip-magic prefix, byte-for-byte round-trip equivalence with write_pprof through flate2::read::GzDecoder, and empty-snapshot totality. ClickUp ticket: 86aj0h8af

…ly supported (#29) * Publish heap-profiling benchmark results (86aj0h88j) - Run snmalloc-rs/benches/profile_bench.rs end-to-end with --features profiling on Apple M4 Pro / macOS 26.3.1; capture mean / CI / median / stddev from target/criterion/*/new/estimates.json. - New docs/heap-profiling-benchmarks.md table-formats the raw numbers for the small_allocs / medium_allocs / mixed groups across the three variants (profile-off, profile-on-inactive, profile-on-active). - Compute ratio_idle and ratio_active per group; averages are ~1.024 in both configurations, max ratio is 1.0493 on medium_allocs/profile-on-inactive. All groups stay inside the bench harness's documented <=1.05 acceptance band. - Document the gap vs the existing "<1% overhead" README claim: small allocs support it (in noise), but medium and mixed land at ~3-5%. Recommend softening the README phrasing in a follow-up PR. - No groups hit the 20-minute time budget; full sweep ~85s wall-clock. * Link perf-regression ticket; keep README <1% claim as target - Replace 'soften README claim' recommendation with link to ClickUp ticket 86aj0hfmc that drives medium/mixed under 1% - Keep reproduction caveats (Linux pinning, larger sample_size) - Per user direction: target stays; gap is a perf-regression follow-up, not a docs change

…6aj0hfmc) (#31) - src/snmalloc/profile/sampler.h: hoist the per-thread `sampler_reentered()` check from `Sampler::record_alloc` into `record_alloc_slow`. The hot countdown is now a single TLS decrement plus a signed compare; the reentrancy check only runs on the ~1-in-512-KiB fraction of allocations that already cost a slow-path transition. Sample weighting unchanged -- the `rate - hot_.bytes_until_sample + requested_size` formula already absorbs the overshoot when the counter ticks negative under re-entry. - src/snmalloc/profile/record.h: reorder `record_dealloc<Config>` so the cheap slab-metadata probe and atomic-slot peek run before the `ReentrancyGuard` is constructed. The common-case (object on a slab with no installed lazy backing, or slab installed but specific object never sampled) now skips the TLS store-store-load round-trip from the guard. - docs/heap-profiling-benchmarks.md: re-publish bench numbers after the fix. Idle ratios dropped from a max of 1.0493 to 1.0128 on this host, with two of three groups under 1.01. Documented the cross-run bimodal variance (20-80% on individual variants between back-to-back runs) that prevents this harness on this host from credibly resolving the remaining <3% gap on mixed/active. ClickUp: 86aj0hfmc

…agnosis (86aj0kdym) (#40) Three follow-up perf tweaks on top of bundle 1+3+2 (86aj0jfwh): D. Drop `Sampler::initialized_` boolean and the dedicated `if (!initialized_)` branch in `record_alloc_slow`. Bootstrap state is inferred from `interval_at_capture_ == 0` (which is set to the active sampling rate on first successful slow-path completion; the rate==0 short-circuit earlier means the value is always strictly positive after bootstrap, so it doubles as the "already bootstrapped" signal). Saves one member load + branch every slow-path entry after the first sample on the thread. `Sampler::debug_initialized()` continues to work via the same sentinel. The existing 100k stack-allocated `Sampler` unit-test (`test_sampler_bootstrap`) still hits the bootstrap branch on every instance. E. 5-run noise diagnostic for `medium_allocs/profile-on-active`. The 1.0794 ratio reported in a single PR-#33 run collapses to 0.9990 +/- 0.0086 over 5 fresh `cargo bench --features profiling` runs on the same host (range [0.9853, 1.0090]; every run <= 1.01). The PR-#33 datapoint sits >9 stddevs from this mean; it is consistent with the bimodal macOS-laptop harness noise this doc has called out since Phase 7.2 rather than a real fast-path regression. Doc updated with the full 5-run table; no perfstat/dtrace cache-miss chase was warranted because the noise check showed no consistent signal. F. Branch hints on `record_dealloc_peek<Config>`. The `p == nullptr` early-exit was mis-hinted `SNMALLOC_LIKELY` -- corrected to `SNMALLOC_UNLIKELY` since the overwhelmingly common case is a non-null `free(p)`. The two `slot == nullptr` / `slot->load() == nullptr` early-exits (the actual ~99.999% fall-through paths for non-sampled deallocs) already carried `SNMALLOC_LIKELY`; their hints are kept and the comments updated to call out the fall-through rate explicitly. Verification: * `ctest -R '^func-profile_'` -- 18 / 18 pass (including `test_sampler_bootstrap` which spawns 100k fresh Samplers). * `cargo test --features profiling` -- 5 / 4 / 4 / 13 (lib + tests + doc) pass across 3 back-to-back runs. * `nm` on the release `profile_bench` binary confirms `record_dealloc<Config>`, `record_dealloc_peek<Config>`, `find_profile_slot`, `tl_record_alloc`, and `clear_profile_slot` remain fully inlined; only `record_alloc_slow` and `record_alloc_from_namespace_tls` survive as out-of-line symbols (unchanged from bundle 1+3+2). * `otool -tvV` on `_ZN8snmalloc7deallocEPv` shows the peek as a 3-instruction `add / ldapr / cbnz` sequence at the call site -- the "probe, load, jne" the bundle targets. Touches `src/snmalloc/profile/sampler.h`, `src/snmalloc/profile/record.h`, and `docs/heap-profiling-benchmarks.md` only. Sampler public API (`record_alloc`, `record_alloc_from_namespace_tls`) is unchanged.

Trailing backslashes on // comment lines line-continued the comment into the next source line, which gcc -Werror=comment flags. They were intended as shell continuations in the example commands but have no meaning inside a C++ comment. Drop the backslashes; the example reads the same. Unblocks ubuntu-24.04 Release / Debug builds on fork main.

Add docs/profiling-pmu.md covering the four CPU-microarch gaps that snmalloc itself does not sample: allocation hot-spots, cache misses (Linux + macOS), false sharing, and branch-hint miss rates. Each section provides a runnable perf/Instruments capture sequence plus the join against snmalloc metadata (lookup_alloc_site from Phase 10.1, branch_hints.json from Phase 10.2, automation via snmalloc-tools in Phase 10.4). Closes with explicit non-goals so embedders know what snmalloc will and will not do at runtime. Link the new doc from the README Heap Profiling section.

…d Stats references (#42) The USE_SNMALLOC_STATS CMake define was propagated by snmalloc-sys/build.rs and BUILD.bazel, but the only code that observed it -- two #ifdef blocks in src/test/perf/contention/contention.cc -- referenced a Stats class and current_alloc_pool()->aggregate_stats() that no longer exist anywhere in the source tree. The flag was bit-rot. This commit: - Deletes the two dead #ifdef USE_SNMALLOC_STATS blocks in contention.cc (the surviving usage::print_memory() call is untouched). - Renames the CMake-facing symbol USE_SNMALLOC_STATS -> SNMALLOC_STATS in snmalloc-rs/snmalloc-sys/build.rs, BUILD.bazel, and docs/BUILDING.md. - Leaves the public-facing snmalloc-rs "stats" Cargo feature unchanged; only the internal C-side symbol is renamed. The renamed symbol is currently harmless (no C++ code consumes it). It is re-claimed here so subsequent phases can wire real stats APIs onto it without colliding with the old dead-code definition. git grep USE_SNMALLOC_STATS now returns empty.

Adds scripts/dump_branch_hints.py, a stdlib-only Python 3 script that scans src/snmalloc/ for every SNMALLOC_LIKELY(...) / SNMALLOC_UNLIKELY(...) call site and emits a JSON sidecar of {file, line, kind} entries. The macro-definition lines in ds_core/defines.h are filtered out so consumers don't have to. Output is deterministically sorted for diff-friendly review. Wires it into CMake as a stand-alone target branch_hints_inventory that writes the sidecar to ${CMAKE_BINARY_DIR}/snmalloc_branch_hints.json and installs it under share/snmalloc/. The target is NOT a dep of the main library so a missing Python interpreter never blocks ordinary builds — FindPython3 is QUIET and the target is conditionally registered. snmalloc-rs/snmalloc-sys/build.rs gains a best-effort step that locates the sidecar (falling back to invoking dump_branch_hints.py directly when the script is present in source_root/scripts/) and copies it into OUT_DIR/branch_hints.json, exposing SNMALLOC_BRANCH_HINTS_JSON via cargo:rustc-env for downstream Rust consumers. All failures are silent so Rust builds keep working without python3 installed. Consumed by Phase 10.4 (snmalloc-tools) to flag inverted hints from perf branch-miss samples.

Two deliverables for the Phase 10 PMU-attribution work: A. HeapProfile::top_sites(n, key) -> Vec<HotSite> Pure post-processing over the existing snapshot samples; ranks call sites by inclusive Weight::Allocated bytes. Three grouping modes (CallSite, LeafFrame, FullStack); CallSite currently degrades to LeafFrame in the unsymbolicated build pending a future symbol-based allocator-frame filter. B. SnMalloc::lookup_alloc_site(addr) -> Option<Frames> Address -> alloc-site reverse lookup for live sampled allocations. Accepts interior pointers. Backed by a new header-only helper snmalloc::profile::lookup_alloc_site() that builds a sorted-by-base index from a SampledList snapshot at call time and binary-searches for containment. Off the alloc hot path; never mutates the lock-free SampledList. C ABI surface: sn_rust_profile_lookup_alloc_site(addr, out_frames, max_frames, out_base_addr, out_allocated_size) Lives in rust.cc alongside the rest of the rust FFI shim (the Phase 10.1 spec called for a separate addr_lookup.cc; folding the symbol into rust.cc avoids duplicating the SNMALLOC_PROFILE build wiring and matches the existing pattern for every other sn_rust_profile_* export).

…ld) (#46) Lands the public surface for the broader Phase 9 telemetry work. All wave-2 Phase 9 tickets (9.2 fast/slow path counters, 9.3 per-class histograms, 9.4 mapping accounting, 9.5 lifetime histogram) will populate fields on this struct without changing the wire layout. Adds: - src/snmalloc/global/stats_export.h declaring `struct snmalloc_full_stats` (POD layout, fixed-width integers, forward-compat `reserved[]` pool) and the `snmalloc_get_full_stats` C ABI getter prototype. The SNMALLOC_FULL_STATS_VERSION macro lets newer producers add fields at trailing slots without invalidating older consumers. - src/snmalloc/override/stats_export.cc implementing the getter: `memset(out, 0)` then populate `version`, `bytes_in_use`, and `peak_bytes_in_use` by delegating to `Alloc::Config::Backend::get_current_usage/get_peak_usage`. Every other field stays zero at the scaffold stage. - snmalloc-rs/snmalloc-sys/src/lib.rs FFI mirror (`#[repr(C)] struct snmalloc_full_stats`, matching `SNMALLOC_FULL_STATS_VERSION` / `SIZECLASS_SLOTS` / `LIFETIME_BUCKETS` / `RESERVED_SLOTS` constants, `extern "C" fn snmalloc_get_full_stats`). - snmalloc-rs/src/lib.rs idiomatic Rust mirror `FullAllocStats` with `Copy`/`Debug`/`PartialEq` + manual `Default`, and `SnMalloc::full_stats()` method. Gated behind the existing `stats` Cargo feature so consumers without it get a compile-time error rather than a runtime-zero stub. - snmalloc-rs/tests/full_stats.rs integration test asserting the version matches `SNMALLOC_FULL_STATS_VERSION`, that `bytes_in_use > 0` after a 1 MiB live allocation, that `peak_bytes_in_use >= bytes_in_use`, that the peak is monotone across a dealloc, and that every wave-2 field reads as zero. Wired into the build: - CMakeLists.txt adds stats_export.cc to both the libsnmallocshim ALLOC list (so the symbol ships in libsnmallocshim.so/.dylib) and the Rust static-library RUST list (so it ships in libsnmallocshim-rust.a). - snmalloc-rs/snmalloc-sys/build.rs includes the new TU on the `build_cc` path alongside rust.cc. Verified: - `cmake -B build -DSNMALLOC_STATS=ON -DSNMALLOC_RUST_SUPPORT=ON && cmake --build build -j4` builds without errors. - `nm build/libsnmallocshim.dylib | grep snmalloc_get_full_stats` shows the exported symbol; same for libsnmallocshim-rust.a. - `cargo build` (without `stats`) succeeds and `full_stats()` is not visible. - `cargo test --features stats --test full_stats` passes all 3 scaffold tests.

…47) New workspace member crate that joins external PMU output (Linux perf) with snmalloc's in-tree allocation-site lookup (Phase 10.1) and branch-hint inventory (Phase 10.2). Subcommands: - profile-top: top-N allocation sites from the in-process snapshot - pmu-join cache-misses: data-addr -> alloc-site via lookup_alloc_site - pmu-join c2c: HITM cache lines -> alloc-site - branch-misses: cross-reference perf script with branch_hints.json All subcommands support --json for structured output. The lookup_alloc_site live-process limitation is documented in the crate README and the CLI long_about; integration tests exercise the in-process join path against allocations made by the test binary itself.

Populate `bytes_mapped`, `bytes_committed`, and `bytes_decommitted_to_os` in the FullAllocStats snapshot (`snmalloc_get_full_stats`). * New `backend_helpers/fragstats.h` exposes the `BackendFragCounters` aggregator and `get_backend_frag_stats()` reader. Two process-global `stl::Atomic<size_t>` counters track live committed bytes and cumulative bytes decommitted via the PAL. * `commitrange.h` is instrumented at the `notify_using` / `notify_not_using` boundary: a successful commit bumps `bytes_committed`; every decommit subtracts it (clamped at zero) and adds to the monotone `bytes_decommitted_to_os` total. * `bytes_mapped` reuses the existing StatsRange accounting that already backs `bytes_in_use`, since snmalloc only ever has live mappings for memory it also has a backend reservation for. * `override/stats_export.cc` populates the three new fields inside a clearly-marked `// Phase 9.4` block, leaving the other wave-2 ticket slots free. * New Rust integration test `full_stats_backend_frag_invariants` exercises the wire-up: drives traffic through the CommitRange, asserts `bytes_committed > 0`, `bytes_committed <= bytes_mapped`, and that `bytes_decommitted_to_os` is monotone non-decreasing across a free. The previous "fields are zero" assertion is dropped for the 9.4 slots.

#49) Add a `snmalloc::RuntimeConfig` singleton in `src/snmalloc/global/runtime_config.h` that exposes three process-wide knobs that were previously compile-time constants: * sample_interval_bytes (mean Poisson interval; default 512 KiB) * decay_rate_ms (chunk decay window; default 50 ms) * max_local_cache_bytes (per-thread cache cap; default 1 MiB) All three live in function-local `std::atomic` storage so they are safe to call from any thread at any point in the process lifetime, including before the first allocation (no global-init order dependency). C ABI shims in `src/snmalloc/override/runtime_config.cc` expose the canonical setter/getter pairs (`snmalloc_set_*` / `snmalloc_get_*`) and the sample-interval setter additionally mirrors into `Sampler::set_sampling_rate` when `SNMALLOC_PROFILE` is defined so the existing profiler slow-path picks the value up without churning the sampler hot path. Rust bindings in `snmalloc-rs::SnMalloc` expose six methods (`set_sample_interval` / `sample_interval` etc.) unconditionally, independent of the `stats` and `profiling` Cargo features. An integration test `tests/runtime_tunables.rs` covers roundtrip, cross- thread visibility, independence, and the non-zero default contract; verified passing under default, `stats`, and `profiling` feature configurations. Backend read-side hooks for decay_rate_ms and max_local_cache_bytes are deferred to a follow-up ticket: the existing decay path is entangled with the `Range` template stack and the per-thread cache cap has a similar shape, so a careful point-fix carries regression risk worth isolating in its own change. The setter / getter / FFI surface is already in place so consumers can be wired without churning the C ABI.

Records log2-spaced sampled-allocation lifetimes in nanoseconds: - New `snmalloc::profile::LifetimeHistogram` singleton with 32 atomic buckets, hit via `record_lifetime_ns()` on the dealloc path of a sampled allocation. - `SampledAlloc::alloc_ts_ns` stamped from `steady_clock::now()` in `profile::record_alloc` right after the sampler slow path returns (sampler.h left untouched -- owned by ticket 9.7). - `clear_profile_slot` computes the lifetime under the linearising CAS that retires each sample and bumps the matching log2 bucket. - `snmalloc_get_full_stats` (Phase 9.1 scaffold) populates `lifetime_buckets_ns[32]` when `SNMALLOC_PROFILE` is defined; without it the field stays zero via the existing `memset`. - New C ABI `sn_rust_profile_lifetime_histogram(out, len)` -> count exposes the buckets to Rust; degrades to a zero-writing stub when profiling is off. - `HeapProfile::lifetime_histogram() -> [u64; 32]` is the safe Rust wrapper. Integration test (`profile_lifetime_histogram.rs`) asserts the API smoke + an alloc/sleep(50ms)/dealloc round bumps a bucket with `log2(ns) >= 25`.

…51) Wires up the wave-2 9.2 fields of `snmalloc_full_stats`: fast_path_allocs / slow_path_allocs / fast_path_deallocs / remote_deallocs / message_queue_drains / cross_thread_messages_received. Counters live in a new `FrontendStats` block embedded in every `Allocator`, gated by `SNMALLOC_STATS` (new CMake option, off by default). All increments are non-atomic writes against the per-thread allocator's `stats`, so the hot path stays allocator-local; cross-thread reads from `snmalloc_get_full_stats` sum the live `AllocPool::iterate()` walk plus a process-global drain pot that `ThreadAlloc::teardown` populates on thread exit so terminated threads' counters survive into snapshots. Tests added: - src/test/func/fast_path_counters: C++ test that bursts 1k same-sizeclass allocs/frees on one thread, then spawns a worker that performs 128 cross-thread frees (each 512 bytes so K * 512 = 64 KiB saturates the worker's remote-dealloc cache and forces an in-thread `post()`). Verifies all six 9.2 counters move by their expected amounts. - snmalloc-rs/tests/frontend_stats.rs: Rust integration test mirroring the C++ coverage, gated by the existing `stats` Cargo feature. - snmalloc-rs/tests/full_stats.rs: existing scaffold test updated to no longer assert-zero the 9.2 fields (now populated); 9.3/9.4/9.5 fields still asserted-zero for the remaining wave-2 tickets. ClickUp: 86aj0tr1e

Populate the four `FullAllocStats` per-class arrays (`total_live_bytes_by_class`, `total_live_count_by_class`, `cumulative_alloc_by_class`, `cumulative_dealloc_by_class`) by embedding a per-thread `SizeClassStats` block alongside the Phase 9.2 `FrontendStats` block on `Allocator<Config>`. * Counters are plain `uint64_t` arrays of length `NUM_SMALL_SIZECLASSES`, mutated only on the owning thread, so alloc / dealloc fast paths stay atomic-free. * Bump sites: fast-path `small_alloc`, slow-path stash refill, and `small_refill_slow` after backend refill all bump `cumulative_alloc[sc]` + `live_count[sc]` + `live_bytes[sc]`. Local-fast-path dealloc decrements live and bumps `cumulative_dealloc`; remote dealloc bumps `cumulative_dealloc` on the freeing thread and defers the live decrement to the owning thread's `handle_dealloc_remote` message-queue drain (delta computed from `bytes_returned`). * A process-global `SizeClassStatsGlobal` aggregator with relaxed atomics catches counters drained at thread teardown (extending the existing `drain_stats_to_global` path so pool reuse stays clean). * `snmalloc_get_full_stats` extends the Phase 9.2 pool walk with a `SizeClassStats` accumulator and copies the result into the FFI struct. Static assert pins `NUM_SMALL_SIZECLASSES <= SNMALLOC_FULL_STATS_SIZECLASS_SLOTS`. * Compiles away entirely with `SNMALLOC_STATS=OFF`. Tests * New `snmalloc-rs/tests/sizeclass_histogram.rs` (gated `stats` feature): pins a single sizeclass, asserts cumulative + live rise by N, frees, asserts live drops and cumulative_dealloc rises monotonically. Second test asserts the `cumulative_alloc >= cumulative_dealloc` invariant across every slot. * `snmalloc-rs/tests/full_stats.rs`: removes the 9.3 zero assertions (fields are now wired). * Verified: all 106 C++ ctest cases pass with stats on, all snmalloc-rs tests pass with `--features stats`, and the stats-off build remains clean. ClickUp: 86aj0tr4p

#53) Adds a tcmalloc-style human-readable text dump over the Phase 9.1 FullAllocStats snapshot. Pure formatter -- no new telemetry. Exposes: * snmalloc::dump_stats(FILE*) / dump_stats_to_string(std::string&) C++ overloads. * snmalloc_dump_stats_to_buffer(buf, len) FFI-safe buffer routine with snprintf truncation semantics. * SnMalloc::dump_stats(&mut impl io::Write) safe Rust wrapper that uses the standard size-query + alloc + fill two-phase pattern. Output is a header of MALLOC: lines (bytes in use, peak, committed/decommitted, fast/slow alloc/dealloc counters, cross-thread message metrics). Optional sections appear when the underlying data is non-zero: a per-size-class table (populated by 9.3) and a log2-spaced lifetime histogram (populated by 9.5). Integration test (snmalloc-rs/tests/dump_stats.rs) covers structural regex match against the canonical 'Bytes in use by application' line, writer-error propagation, and back-to-back-call independence.

With the `symbolicate` Cargo feature enabled, `top_sites` now walks each sample's stack from the leaf outward and buckets on the first frame whose resolved symbol does **not** match an allocator namespace prefix (`snmalloc::`, `snmalloc_rs::`, `snmalloc_sys::`, the mangled `_ZN8snmalloc`, or the `__rust_alloc` / `__rg_alloc` GlobalAlloc thunks). If every frame is allocator-internal the leaf frame is used so no sample is dropped. Without `symbolicate`, `CallSite` degrades to `LeafFrame` and emits a one-shot `eprintln!` (guarded by `std::sync::Once`) advertising the feature. The fallback is total: synthetic samples still produce a non-empty result. Tests: - callsite_groups_by_user_caller (symbolicate): two distinctly named, `#[inline(never)]` probe functions capture real backtraces via `backtrace::trace`; `top_sites(.., CallSite)` produces two buckets and conserves total bytes/sample count. - callsite_falls_back_when_no_user_frame (symbolicate): a sample whose entire stack is unresolvable still produces a non-empty bucket whose leaf is the unresolvable address (not the empty- stack null sentinel). - callsite_fallback_when_unsymbolicated (default features): pins the fallback contract -- CallSite behaves as LeafFrame and doesn't panic. ClickUp: 86aj0x1qb

) The Phase 10.2 sidecar generator (scripts/dump_branch_hints.py at the snmalloc repo root) ships only with the surrounding repo, not with the published snmalloc-sys crate. snmalloc-sys's Cargo `include` whitelists `upstream/CMakeLists.txt`, `upstream/src/**`, and `upstream/fuzzing/**` -- everything else under the repo root, including `scripts/`, is stripped by `cargo package`. Result: consumers installing via `cargo add snmalloc-rs --features stats` never see the script, so the build.rs best-effort fallback that runs it to generate `OUT_DIR/branch_hints.json` is a no-op for them, and snmalloc-tools (Phase 10.4) loses its sidecar. Fix: vendor the script under `snmalloc-rs/snmalloc-sys/upstream/scripts/` and extend the Cargo include whitelist to cover `upstream/scripts/**`. The new copy carries a header pointing back at the canonical source so re-vendoring stays explicit ("update upstream and re-vendor"). The repo-root `scripts/dump_branch_hints.py` is left in place as the canonical version; this commit only adds a second copy under the vendored tree. build.rs gains two small upgrades: 1. The python3 fallback now invokes the script with both `--repo-root` and `--source-dir` explicitly, derived by canonicalising `<upstream>/src/snmalloc`. The script's default behaviour is to compute paths relative to `--repo-root`, but in the snmalloc dev tree `upstream/src` is a symlink that resolves *out* of `upstream/`, so the old single-argument invocation crashed with `Path.relative_to` raising `ValueError`. The new invocation handles both the symlinked dev layout and the flat published-crate layout without touching the script semantics. 2. `cargo:rerun-if-changed=<script>` is now emitted before invoking python3 so re-vendoring picks up automatically on incremental builds. Verification: * `cargo package --list -p snmalloc-sys` shows `upstream/scripts/dump_branch_hints.py` in the tarball file list. * Consumer smoke test (`cargo new` + `cargo add --path /Users/jayakasa/dev/snmalloc/snmalloc-rs --features stats` + `cargo build -vv`) shows `cargo:rustc-env=SNMALLOC_BRANCH_HINTS_JSON=<OUT_DIR>/branch_hints.json` and the file contains 101 hint sites (50/51 LIKELY/UNLIKELY) over 7152 bytes. * `cargo test -p snmalloc-rs --features stats` still passes (including the existing branch-hints fixture coverage in snmalloc-tools integration tests).

…ved[0..16] (#57) Surface a log2-bucketed view of currently-free chunks held inside the LargeBuddyRange pools via the FullAllocStats FFI surface. The histogram lives in `reserved[0..15]`, bumping SNMALLOC_FULL_STATS_VERSION to 2 as an additive (offset-preserving) extension of the wire format. Backend wiring: - `Buddy` gains a histogram-callback template parameter (default `BuddyNoHistogram`, a no-op) so existing users like `SmallBuddyRange` pay zero overhead. Insertions/removals of free blocks into the per-bucket cache and red-black tree invoke `on_add` / `on_remove`. - `LargeBuddyRange` plugs in the new `LargeBuddyFreeChunkHistogram`, a process-global atomic array (16 buckets, `MIN_CHUNK_BITS` based) aggregating populations across every live `LargeBuddyRange` Buddy. - `BackendFragStats` carries the histogram alongside the existing Phase 9.4 commit/decommit counters; `get_backend_frag_stats()` snapshots all three. - `LargeBuddyRange::Type::get_free_chunk_count_by_log_size` is the range-API accessor; the FullAllocStats getter in stats_export.cc copies the 16 buckets into `reserved[0..15]`. FFI / Rust binding: - `SNMALLOC_FULL_STATS_VERSION` bumped to 2. - New `SNMALLOC_FULL_STATS_FREECHUNK_BUCKETS = 16` constant. - `snmalloc-sys` re-exports both. - `FullAllocStats` gains a `reserved: [u64; 64]` field and a typed `free_chunk_histogram() -> [u64; 16]` accessor. Test: - `full_stats_freechunk_histogram_populates` (gated on the `stats` Cargo feature): drive 10 x 1 MiB alloc+free through the allocator, assert at least one histogram bucket is non-zero and that the typed accessor agrees with the raw `reserved[]` slots.

Add a Criterion bench (snmalloc-rs/benches/stats_bench.rs) that mirrors profile_bench.rs but installs SnMalloc as the #[global_allocator] so the sn_rust_alloc / sn_rust_dealloc FFI thunks (which carry the SNMALLOC_STATS counter sites) are actually exercised on each iteration. Without the global-allocator install the bench measures libc malloc and the stats feature has no observable effect. The on/off comparison is across two cargo bench runs of the same binary spec (cargo features are compile-time gates), and the criterion sub-directory name (stats-on vs stats-off) keeps the two runs from overwriting each other. Acceptance per Phase 9 wave-2 spec is max 5-run mean ratio <= 1.02. Measured on Apple M4 Pro (fat-LTO, release): small_allocs : 5-run mean ratio 1.4370 (median 1.2790) medium_allocs : 5-run mean ratio 1.0261 (median 1.0983) mixed : 5-run mean ratio 1.5339 (median 1.1251) Every group fails. Even discounting bimodal harness outliers, every group's median ratio is >= 1.10 -- signal is real, not noise. Follow-up ticket 11.5 (86aj0xap7) tracks the hot-path reduction work; this PR is verify-only per spec. Full numbers and methodology are appended to docs/heap-profiling-benchmarks.md under "Phase 9 stats overhead". ClickUp: 86aj0x1f4

…e padding + trim cumulative arrays (#58) Applies two of the three candidate levers from ticket 86aj0xap7: * Lever 1 — `alignas(CACHELINE_SIZE)` on `FrontendStats` and `SizeClassStats` so the per-thread counter blocks sit on dedicated cache lines, eliminating false sharing with adjacent hot `Allocator` members. * Lever 3 — drop the per-class `SizeClassStats::cumulative_alloc` store from the alloc fast path; derive the value at snapshot time from the invariant `cumulative_alloc = live_count + cumulative_dealloc`. FFI / output layout unchanged. 5-run mean ratios (SNMALLOC_STATS=ON / OFF) on the same harness and host that produced Phase 11.1's failing baseline: * small_allocs: 1.4370 -> 1.1588 * medium_allocs: 1.0261 -> 1.0337 * mixed: 1.5339 -> 1.0975 Worst-case 5-run mean cut from `mixed` 1.5339 down to `small_allocs` 1.1588 — roughly a 60% reduction in the over-budget portion. The 1.02 spec target is NOT reached: the remaining ~16% on `small_allocs` is the irreducible cost of the four remaining counter stores on the small-alloc fast path (`fast_path_allocs++`, `live_count[sc]++`, `live_bytes[sc] += sz` plus the corresponding fast-dealloc trio). None can be elided while keeping the existing observability surface intact. Lever 2 (batch counter updates) was investigated and shelved — the existing per-thread counters are already non-atomic stores into a cache-line-resident block; there is nothing meaningful to batch except the stores themselves, which the compiler already coalesces when inlined. Recommendation captured in the docs and routed to a follow-up ticket: split `SNMALLOC_STATS` into `_BASIC` (8 counters, target <= 1.02) for production and `_FULL` (current behaviour, adds per-class + lifetime histograms, target <= 1.20) for diagnostic builds. Alternative: tighten the spec target from 1.02 -> 1.17 to acknowledge the fundamental counter cost. Docs updated: `docs/heap-profiling-benchmarks.md` "Phase 9 stats overhead" section now records the post-Phase-11.5 numbers, marks acceptance as PARTIAL, and documents the recommendation.

Splits the monolithic SNMALLOC_STATS flag into two independently selectable tiers so production builds can opt into the cheap counter surface without paying for the expensive per-size-class histogram. * SNMALLOC_STATS_BASIC -- frontend fast/slow path counters (9.2) + backend commit/decommit (9.4) + largebuddy free-chunk histogram (11.4). Target overhead <=2% (measured 1.03-1.08 on this host). * SNMALLOC_STATS_FULL -- BASIC plus per-size-class histogram (9.3) and lifetime histogram (9.5). Target overhead <=20% (measured 1.09-1.16). The legacy SNMALLOC_STATS flag is preserved as a backwards- compatible alias for BASIC; FULL implicitly enables BASIC. The FullAllocStats wire format is unchanged -- fields the active tier does not maintain simply read as zero -- so SNMALLOC_FULL_STATS_VERSION is not bumped. Cargo: `stats-basic` and `stats-full` features added in both snmalloc-rs and snmalloc-sys; `stats` is now an alias for `stats-basic`; `stats-full` implies `stats-basic` so the snmalloc-rs SnMalloc::full_stats() accessor remains available under either tier. 5-run bench results on Apple M4 Pro (vs OFF baseline): Group basic/off full/off small_allocs 1.0774 1.1639 medium_allocs 1.0398 1.0935 mixed 1.0310 1.0910 FULL meets the <=1.20 budget on every group. BASIC sits ~5-8% above OFF -- above the 1.02 spec but ~50% closer than the 1.16 Phase 11.5 floor. The remaining ~8% on small_allocs is the irreducible cost of two non-atomic stores per alloc+dealloc (stats.fast_path_allocs++ / stats.fast_path_deallocs++) on a ~200 ns inner-loop iteration. See docs/heap-profiling-benchmarks.md "Phase 11.6 -- tiered SNMALLOC_STATS overhead" for the full table and methodology. ClickUp: 86aj0ydjv

The `frontend_stats`, `full_stats`, `sizeclass_histogram`, and `profile_lifetime_histogram` integration tests rely on the test binary's allocations feeding snmalloc's process-global counters. Without `#[global_allocator] static ALLOC: SnMalloc = SnMalloc;` at the top of each binary, the default cargo test runner routes allocations through the OS allocator and the counters under test stay at zero, causing intermittent panics such as `fast_path_allocs delta (=0) must rise by at least 990`. Mirrors the pattern already used by `snmalloc-rs/benches/stats_bench.rs` (Phase 11.1). No test logic was changed. ClickUp: 86aj0yehx

Move the fast_path_allocs counter update out of the per-alloc fast path into a single pre-credit at refill time. The slow path knows the refilled free-list length N, so it credits fast_path_allocs += N once at small_refill / small_refill_slow and the fast path skips the store entirely. Plumbed via a new uint16_t& out parameter on FrontendSlabMetadata::alloc_free_list, computed as sizeclass_to_slab_object_count(sizeclass) - remaining (exact for freshly-built slabs, upper-bound for recycled slabs from the per-class stash). Bounded by the slab object count, ~256 for the smallest classes. Trade-off: counter may briefly overshoot true alloc count by up to N between refills. Acceptable for observability. Bench numbers (5 runs per variant, Apple M4 Pro, fat-LTO): small_allocs 1.0774 -> 1.0155 (PASS, ~80% closer to spec) medium_allocs 1.0398 -> 1.0202 (FAIL*, within bench noise) mixed 1.0310 -> 1.0290 (FAIL, untouched dealloc-side counter) Result PARTIAL on the strict <=1.02 spec; small_allocs (the targeted group) passes cleanly. Phase 11.9 is filed to apply the same approach to dealloc-side counters. See docs/heap-profiling-benchmarks.md "Phase 11.8 -- batched fast_path counter updates" for the full table.

Mirrors the Phase 11.8 batched-counter pattern on the dealloc side: drop the per-dealloc `stats.fast_path_deallocs++` store at the local-owner branch of `Allocator::dealloc` and pre-credit `stats.fast_path_deallocs += refill_count` at slab refill in `small_refill` / `small_refill_slow`. Each object placed onto the fast free list is assumed to be freed locally; cross-thread frees still bump `remote_deallocs` per-object, so the granting thread's `fast_path_deallocs` is over-credited by the count of objects freed by another thread (drift is bounded by program behaviour and documented on the field). The `frontend_stats.rs::fast_path_alloc_counter_grows` test now measures the cumulative dealloc count against the `before` snapshot rather than `after_alloc`, since the credit lands at slab-grant time (before the explicit dealloc loop) -- same end-to-end invariant, just a different measurement window. Apples-to-apples 2-run mean on the same host vs the 11.8 baseline at HEAD: small_allocs: 0.9960 (11.8) -> 1.0006 (11.9), both PASS medium_allocs: 1.0616 (11.8) -> 1.0611 (11.9), both FAIL mixed: 1.0271 (11.8) -> 1.0244 (11.9), both FAIL The dealloc store is gone but `medium_allocs` did not close -- the residual ~5-6% on this host is not store-bound; the bench ratio for medium_allocs is unchanged between 11.8 and 11.9. Likely candidates are bytes_in_use atomics on the slab refill path and codegen differences between OFF and BASIC compiles. Closing that gap requires either a sampled-counter tier or spec relaxation; tracked in docs/heap-profiling-benchmarks.md (Phase 11.9 section).

`BackendFragCounters::bytes_committed` + `bytes_decommitted_to_os` shared a cache line, as did `StatsRange::current_usage` + `peak_usage`. Every `notify_using` invalidated the line that the matching `notify_not_using` had just read, and the `current_usage`/`peak_usage` CAS dance bounced the line for no reason. Add `alignas(64)` to each global atomic so each lives on its own cache line. Cost: ~96 bytes of additional BSS per template instantiation. Correctness unchanged. Diagnostic write-up + recommended next steps in docs/heap-profiling-diagnostic-11-10.md.

5-run sweep on Apple M4 Pro after merging Phase 11.10 alignas(64) padding (commit f3ee3a1). Results: small_allocs 0.996 PASS medium_allocs 1.122 FAIL (variance-dominated, sigma 4.7%) mixed 1.018 PASS (moved from 1.027 post-alignas) Disassembly diff confirms zero instruction delta in the inline Allocator<...>::small_alloc and ::dealloc fast paths. Remaining cost lives in the _malloc / _calloc FFI shim thunks (+10 / +14 instructions). medium_allocs amplifies the shim cost because its 4 KiB allocs go through std::alloc::alloc on every iteration. mixed passing the strict 1.02 spec is the new datapoint here. medium_allocs variance exceeds the spec gap; Linux pinned bench (ticket 86aj0jg36) is the authoritative next step.

Disassembly of `_malloc` on the Phase 11.11 baseline showed the BASIC tier `medium_allocs` residual cost concentrated at two adjacent counter stores on the small-refill slow path: - `stats.slow_path_allocs++` at the entry to `small_refill` (ldr/add/str on field 0x2388). - `stats.fast_path_allocs += refill_count` at the refill site (ldr/add/str on adjacent field 0x2380). `medium_allocs` (4 KiB allocations) hits `small_refill` more often than `small_allocs` because each chunk yields fewer objects per refill, so the per-refill counter cost is the residual. Pack the two fields into one 64-bit `FrontendStats::packed_allocs`: - bits 0-47: cumulative_allocs (fast + slow combined) - bits 48-63: slow-path call count At the refill site the two stores collapse into ONE packed `+=`: stats.packed_allocs += static_cast<uint64_t>(refill_count) + PACKED_ALLOCS_SLOW_INC; The two lanes occupy disjoint bit ranges so the packed `+=` is correct as long as neither lane overflows its sub-field width. The 16-bit slow lane saturates at 65535 refills (~16M allocs per thread for the smallest sizeclasses); effectively unbounded for any realistic workload on an observability surface. The `FullAllocStats` FFI struct is unchanged: at aggregation time `stats_export.cc` decodes the packed word back into the public `fast_path_allocs` and `slow_path_allocs` fields. The `FrontendStatsGlobal` thread-exit aggregator drops to a single `fetch_add` for the combined counter. Bench results (apple silicon, paired OFF/BASIC): group | OFF (ns) | BASIC (ns) | ratio | small_allocs | ~203.7 | ~203.7 | 1.00 | medium_allocs | ~1039 | ~1032 | 0.99 | mixed | ~612 | ~612 | 1.00 | vs Phase 11.11 baseline (medium 1.122) -- medium drops to 0.99 (within bench noise of stats-off), all groups <= 1.02. Disassembly delta: the 3-inst `slow_path_allocs++` block at the entry to the inlined `small_refill` is gone; the `fast_path_allocs +=` becomes a 6-inst packed update with one constant materialization for `1ULL << 48`. Net -1 inst in the inlined body and -1 STORE to a separate counter field per slow-path call.

Phase 11.9 moved fast_path_deallocs counter updates from the per-dealloc hot path to a pre-credit at small_refill (alloc time). The test's snapshot window `after_alloc -> after_dealloc` therefore captured zero rise even though the counter had already been credited the matching ~1024 deallocs during the alloc phase. Switch the dealloc-side measurement to `after_dealloc - before`, matching the same fix the Rust frontend_stats test received in Phase 11.9. C++ test logic was missed at the time. Verified locally: - ctest -E "long|stress": 104/104 pass - cargo test --features stats-basic / stats-full / profiling: green - cargo test --workspace: green

Test-only deps (fuzztest, googletest) drag in stale rules_go that breaks downstream consumers using newer rules_cc (the cgo.bzl in older rules_go references CcInfo at its pre-move path). Mark them dev_dependency=True so they are only loaded when snmalloc is the root module. Also gate the rust toolchain registration as dev_dependency: downstream workspaces register their own pin, and silently overriding theirs leads to subtle version skew.

fuzztest + googletest are only consumed by snmalloc's own C++ tests. Marking them dev keeps them out of downstream resolution — fuzztest otherwise drags rules_go in, whose cgo.bzl references a CcInfo symbol removed in modern rules_cc and breaks any bzlmod consumer. The rules_rust toolchain extension is likewise dev-only — downstream workspaces pin their own toolchain and a transitive registration here would silently overlay it.

- cmake/snmalloc_pgo.cmake — included unconditionally at L138 - cmake/run_coverage.cmake — referenced elsewhere

jayakasadev added 30 commits June 10, 2026 16:58

jayakasadev added 30 commits June 11, 2026 19:57

need these files in cmake/:

abae9a8

- cmake/snmalloc_pgo.cmake — included unconditionally at L138 - cmake/run_coverage.cmake — referenced elsewhere

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Heap profiling + tcmalloc-style telemetry parity (Phases 2–11)#857

Heap profiling + tcmalloc-style telemetry parity (Phases 2–11)#857
jayakasadev wants to merge 68 commits into
microsoft:mainfrom
jayakasadev:main

jayakasadev commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jayakasadev commented Jun 12, 2026

Summary

Phases shipped

Final overhead summary

ABI

Test coverage

Review chunking suggestion

Commit list

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant