Skip to content

Switch the fingerprint algo to xxh3_128#1630

Open
Dev-iL wants to merge 2 commits into
apache:mainfrom
SummitSG-LLC:2606/xxhash_algo
Open

Switch the fingerprint algo to xxh3_128#1630
Dev-iL wants to merge 2 commits into
apache:mainfrom
SummitSG-LLC:2606/xxhash_algo

Conversation

@Dev-iL

@Dev-iL Dev-iL commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Split 3 of 3 of #1619 (stacked on #1629)

  1. centralize hashing through a single chokepoint + close type collisions (Centralize hashing through _hash_bytes and tag value types #1628).
  2. vectorize the pandas/polars DataFrame paths (Vectorize pandas and polars DataFrame hashing #1629).
  3. this PR — swap the algorithm to xxhash.xxh3_128.

What this does

Change in _hash_bytes: hashlib.md5(data).digest()xxhash.xxh3_128(data).digest(). xxh3_128 is a non-cryptographic hash designed for speed. It produces a 16-byte (128-bit) digest, the same width as md5, so the base64url-encoded fingerprints stay 24 characters and all downstream interfaces (cache keys, data_version strings) are unaffected.

Adds xxhash>=0.8.0 as a runtime dependency in pyproject.toml. The BSD 2-Clause license text for python-xxhash is appended to LICENSE.

Benchmark

Benchmark code

# benchmark_hash_algorithm.py
"""Benchmark: xxh3_128 vs md5 on the vectorized DataFrame fingerprint path.

Holds the (vectorized) implementation constant and varies only the hashing
algorithm applied by ``_hash_bytes``, to quantify what switching md5 -> xxh3_128
buys:

  * "hash-step": md5 vs xxh3_128 over the row-hash buffer alone. This is the
    raw algorithm advantage and the ceiling on any end-to-end gain.
  * "end-to-end": the full vectorized fingerprint, comparing the current
    xxh3_128 functions against an md5 reconstruction of the same construction.
    Buffer construction (``hash_pandas_object`` / ``hash_rows``) is identical
    under either algorithm, so this is the realistic improvement on the hot
    path once that fixed cost is included.

Informational only (no pass/fail gate). Run directly:

    python benchmarks/benchmark_hash_algorithm.py
"""

import base64
import hashlib
import os
import platform
import re
import time

import numpy as np
import pandas as pd
import xxhash

from hamilton.caching import fingerprinting as fp

# Swept so the realized speedup can be read as a function of buffer size.
SIZES = (500, 5_000, 50_000, 500_000, 5_000_000)


def _cpu_model() -> str:
    try:
        with open("/proc/cpuinfo") as f:
            for line in f:
                if line.startswith("model name"):
                    return line.split(":", 1)[1].strip()
    except OSError:
        pass
    return platform.processor() or "unknown"


def _ram_info() -> str:
    import subprocess

    total = "unknown"
    try:
        with open("/proc/meminfo") as f:
            for line in f:
                if line.startswith("MemTotal"):
                    kb = int(re.search(r"\d+", line).group())
                    total = f"{kb / 1024 / 1024:.0f} GiB"
                    break
    except OSError:
        pass
    detail = ""
    try:
        out = subprocess.check_output(
            ["dmidecode", "-t", "memory"], text=True, stderr=subprocess.DEVNULL,
        )
        typ = re.search(r"^\s*Type:\s*(DDR\S*)", out, re.MULTILINE)
        spd = re.search(r"^\s*Configured Memory Speed:\s*(\d+\s*\S+)", out, re.MULTILINE)
        if not spd:
            spd = re.search(r"^\s*Speed:\s*(\d+\s*\S+)", out, re.MULTILINE)
        parts = [m.group(1) for m in (typ, spd) if m]
        if parts:
            detail = f" ({' '.join(parts)})"
    except (OSError, subprocess.CalledProcessError):
        pass
    return f"{total}{detail}"


def _print_env() -> None:
    import sys

    import polars as pl

    print("== Environment ==")
    print(f"  CPU      : {_cpu_model()} ({os.cpu_count()} logical cores)")
    print(f"  RAM      : {_ram_info()}")
    print(f"  Platform : {platform.system()} {platform.release()} ({platform.machine()})")
    print(f"  Python   : {sys.version.split()[0]}")
    print(f"  numpy    : {np.__version__}")
    print(f"  pandas   : {pd.__version__}")
    print(f"  polars   : {pl.__version__}")
    print(f"  xxhash   : {xxhash.VERSION}")
    print()


def _md5_compact(data: bytes) -> str:
    return base64.urlsafe_b64encode(hashlib.md5(data).digest()).decode()


def _numpy_md5(obj) -> str:
    """The numpy path, but hashing the buffer with md5.

    Unlike the DataFrame paths there is no per-row baseline to remove: numpy has
    always hashed ``shape:dtype`` + raw ``tobytes()`` in one shot. Even so the
    end-to-end speedup is bounded well below the raw algorithm speedup, because
    materializing the buffer (``tobytes()`` + concatenation) is a memcpy that
    costs about as much as md5 hashing it.
    """
    metadata = f"{obj.shape}:{obj.dtype}".encode()
    return _md5_compact(b"bytes:" + metadata + obj.tobytes())


def _vectorized_pandas_md5(obj) -> str:
    """The vectorized pandas path, but hashing the buffer with md5."""
    from pandas.util import hash_pandas_object

    row_hashes = hash_pandas_object(obj).values.tobytes()
    if hasattr(obj, "columns"):
        schema = f"{list(obj.columns)}:{[str(dtype) for dtype in obj.dtypes]}"
    else:
        schema = f"{getattr(obj, 'name', None)}:{obj.dtype}"
    return _md5_compact(schema.encode() + row_hashes)


def _vectorized_polars_md5(obj) -> str:
    """The vectorized polars path, but hashing the buffers with md5."""
    schema_str = ",".join(f"{name}:{dtype}" for name, dtype in obj.schema.items())
    schema_hash = _md5_compact(b"bytes:" + schema_str.encode())
    row_hash = _md5_compact(b"bytes:" + obj.hash_rows().to_numpy().tobytes())
    return _md5_compact(schema_hash.encode() + row_hash.encode())


def _time(fn, obj, repeats: int = 3) -> float:
    fn(obj)  # warmup
    best = float("inf")
    for _ in range(repeats):
        start = time.perf_counter()
        fn(obj)
        best = min(best, time.perf_counter() - start)
    return best


def _report_hash_step(label: str, n_rows: int, buffer: bytes) -> None:
    mib = len(buffer) / 1024 / 1024
    md5_t = _time(lambda b: hashlib.md5(b).digest(), buffer)
    xxh_t = _time(lambda b: xxhash.xxh3_128(b).digest(), buffer)
    print(
        f"[{label} n={n_rows:>9,}] hash-step  md5 {md5_t * 1e3:8.3f} ms ({mib / md5_t:>7,.0f} MiB/s)"
        f"  xxh3 {xxh_t * 1e3:7.3f} ms ({mib / xxh_t:>7,.0f} MiB/s)  speedup {md5_t / xxh_t:5.1f}x"
    )


def _report_end_to_end(label: str, n_rows: int, md5_fn, xxh_fn, obj) -> None:
    md5_t = _time(md5_fn, obj)
    xxh_t = _time(xxh_fn, obj)
    print(
        f"[{label} n={n_rows:>9,}] end-to-end md5 {md5_t * 1e3:8.1f} ms"
        f"  xxh3 {xxh_t * 1e3:8.1f} ms  speedup {md5_t / xxh_t:5.2f}x"
    )


def _columns(n_rows: int) -> dict:
    return {
        "a": range(n_rows),
        "b": [float(i) for i in range(n_rows)],
        "c": [f"row-{i}" for i in range(n_rows)],
    }


def main() -> None:
    _print_env()

    from pandas.util import hash_pandas_object

    try:
        import polars as pl
    except ImportError:
        pl = None
        print("[polars] not installed; skipping polars")

    for n_rows in SIZES:
        columns = _columns(n_rows)

        arr = np.arange(n_rows, dtype=np.int64)
        _report_hash_step("numpy ", n_rows, arr.tobytes())
        _report_end_to_end("numpy ", n_rows, _numpy_md5, fp.hash_value, arr)

        df = pd.DataFrame(columns)
        _report_hash_step("pandas", n_rows, hash_pandas_object(df).values.tobytes())
        _report_end_to_end("pandas", n_rows, _vectorized_pandas_md5, fp.hash_pandas_obj, df)

        if pl is not None:
            pdf = pl.DataFrame(columns)
            _report_hash_step("polars", n_rows, pdf.hash_rows().to_numpy().tobytes())
            _report_end_to_end(
                "polars", n_rows, _vectorized_polars_md5, fp.hash_polars_dataframe, pdf
            )


if __name__ == "__main__":
    main()

Plotting code

# plot_benchmarks.py
"""Generate Plotly charts for PR2 and PR3 benchmark results.

Reads hardcoded benchmark data (not recomputed) and writes two PNG files
into the same directory as this script.

    python benchmarks/plot_benchmarks.py
"""

from pathlib import Path

import plotly.graph_objects as go

OUT_DIR = Path(__file__).parent

SIZE_LABELS = ["500", "5 K", "50 K", "500 K", "5 M"]

COLORS = {
    "pandas": "#1f77b4",
    "polars": "#ff7f0e",
    "numpy": "#2ca02c",
}

# ── PR2 data (baseline → vectorized) ────────────────────────────────────
PR2 = {
    "pandas": [4.2, 11.0, 12.5, 9.8, 6.0],
    "polars": [13.1, 78.1, 382.0, 431.0, 209.0],
}

# ── PR3 data (md5 → xxh3_128, end-to-end) ───────────────────────────────
PR3 = {
    "numpy": [1.36, 3.21, 6.45, 1.44, 1.55],
    "pandas": [1.03, 0.88, 0.98, 1.00, 1.00],
    "polars": [1.25, 1.84, 2.26, 2.26, 1.67],
}


def _build_chart(data: dict, title: str, log_y: bool = False) -> go.Figure:
    fig = go.Figure()

    for backend, speedups in data.items():
        fig.add_trace(
            go.Scatter(
                name=backend,
                x=SIZE_LABELS,
                y=speedups,
                mode="lines+markers+text",
                text=[f"{s:.1f}×" for s in speedups],
                textposition="top center",
                textfont=dict(size=12),
                line=dict(color=COLORS[backend], width=3),
                marker=dict(size=10),
            )
        )

    fig.update_layout(
        title=dict(text=title, x=0.5, y=0.95, yanchor="top"),
        legend=dict(
            yanchor="top",
            y=0.98,
            xanchor="left",
            x=0.02,
            bgcolor="rgba(255,255,255,0.8)",
        ),
        template="plotly_white",
        height=500,
        width=800,
        margin=dict(t=60, b=60),
    )
    fig.update_xaxes(title_text="Input size (rows)")
    fig.update_yaxes(title_text="Speedup (×)", type="log" if log_y else "linear")

    return fig


def main() -> None:
    pr2_fig = _build_chart(
        PR2,
        title="PR2: Vectorized hashing — speedup over per-row baseline",
        log_y=True,
    )
    pr2_path = OUT_DIR / "PR2_chart.png"
    pr2_fig.write_image(str(pr2_path), scale=2)
    print(f"Wrote {pr2_path}")

    pr3_fig = _build_chart(
        PR3,
        title="PR3: xxh3_128 vs md5 — end-to-end speedup",
    )
    pr3_path = OUT_DIR / "PR3_chart.png"
    pr3_fig.write_image(str(pr3_path), scale=2)
    print(f"Wrote {pr3_path}")


if __name__ == "__main__":
    main()

benchmark_hash_algorithm.py isolates the algorithm swap from the vectorization (#1629), holding the implementation constant and varying only the hash. The raw algorithm is ~8–27× faster, but the end-to-end gain depends on how much of total time the hash step occupies:

rows backend md5 (ms) xxh3 (ms) speedup
500 numpy 0.03 0.02 1.4×
500 pandas 1.0 1.0 1.0×
500 polars 0.1 0.1 1.3×
5,000 numpy 0.1 0.03 3.2×
5,000 pandas 2.9 3.3 0.9×
5,000 polars 0.2 0.1 1.8×
50,000 numpy 0.6 0.1 6.5×
50,000 pandas 32.3 33.0 1.0×
50,000 polars 0.9 0.4 2.3×
500,000 numpy 9.4 6.5 1.4×
500,000 pandas 382 382 1.0×
500,000 polars 9.5 4.2 2.3×
5,000,000 numpy 96.3 62.0 1.6×
5,000,000 pandas 6,434 6,405 1.0×
5,000,000 polars 121 72.6 1.7×

CPU: Intel i7-3770 @ 3.40 GHz · 32 GiB DDR3-1600 · pandas 3.0.3 · polars 1.41.2 · xxhash 3.7.0 · Python 3.14.2

PR3_chart

pandas DataFrames: ~1× (negligible) at all sizes. hash_pandas_object takes 1–6,400 ms; the hash step takes 0.006–52 ms with md5. Hashing is ≤1.5% of end-to-end time. The algorithm swap is invisible here.

polars DataFrames: 1.3–2.3×. hash_rows is fast enough that the hash step is a meaningful fraction of end-to-end time. The gain grows with size as the hash step's share increases.

numpy arrays: 1.4–6.5×. Peaks at mid sizes (~50k) where tobytes() is negligible and hashing dominates; at very small sizes per-call overhead limits the gain, at large sizes the memcpy dominates.

Scalars, sequences, mappings (not benchmarked in isolation): estimated ~1.4–2×, dominated by per-call overhead and base64 encoding. These are the most frequent fingerprint calls in a typical DAG (node inputs, configs, primitives).

Why this is still worth landing

  1. Every _hash_bytes call gets faster — the benefit is broadest on the many small-buffer paths (scalars, sequences, configs) even though it's invisible on pandas DataFrames.
  2. Semantic fit. md5 is a (broken) cryptographic hash misused for speed; xxh3_128 is a non-cryptographic hash designed for exactly this use case.
  3. If the dependency is not acceptable, Centralize hashing through _hash_bytes and tag value types #1628+Vectorize pandas and polars DataFrame hashing #1629 stand alone on md5 with no loss of correctness or vectorization.

Testing

Pinned-digest literals in test_fingerprinting.py are updated to the new xxh3_128 values. Relational tests (must-differ, must-match, cross-type collision) and the full caching suite pass unchanged.

Checklist

  • PR has an informative and human-readable title (this will be pulled into the release notes)
  • Changes are limited to a single goal (no scope creep)
  • Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future TODOs are captured in comments
  • Project documentation has been updated if adding/changing functionality.

Dev-iL and others added 2 commits June 8, 2026 15:06
Replace the per-row Python loops in the DataFrame fingerprinting paths
with single-buffer hashing:

- pandas: hash the `hash_pandas_object(obj).values` uint64 buffer in one
  shot instead of round-tripping through `.to_dict()` and an ordered
  `hash_mapping`; fold column names + dtypes (schema) into the hash so
  frames with identical values but different schemas no longer collide;
  keep the path order-sensitive.
- polars: hash the `hash_rows().to_numpy()` buffer in one shot instead of
  `.to_list()` through a per-element `hash_sequence` loop.

Both paths route through the existing `_hash_bytes` chokepoint, so the
algorithm is unchanged here. The DataFrame digest is deliberately not
pinned to a literal (it depends on library-version-specific dtype reprs);
coverage is via relational schema-collision, dtype-collision and
order-sensitivity tests for both backends.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Swap the single `_hash_bytes` chokepoint from md5 to the
non-cryptographic `xxhash.xxh3_128`. xxh3_128 produces a 16-byte digest
(24 base64url chars, identical width to the md5 it replaces), so digest
width and collision resistance are preserved while throughput on
buffer-bound paths rises substantially.

Declare `xxhash>=0.8.0` as a core runtime dependency (xxh3_128 was added
in 0.8.0); fingerprinting is imported eagerly via the caching adapter, so
it must be a hard dependency rather than an optional extra. Add the
xxhash BSD-2-Clause attribution to LICENSE.

Recompute the portable literal-digest pins (primitives, sequences,
mappings, sets, numpy) against xxh3_128. This is a fingerprint-changing
release: prior cached fingerprints no longer match and will be recomputed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Dev-iL Dev-iL force-pushed the 2606/xxhash_algo branch from 53f42bf to b1d0e68 Compare June 8, 2026 12:07
@jernejfrank

Copy link
Copy Markdown
Contributor

benchmark_hash_algorithm.py isolates the algorithm swap from the vectorization (#1629), holding the implementation constant and varying only the hash. The raw algorithm is ~8–27× faster, but the end-to-end gain depends on how much of total time the hash step occupies:
rows backend md5 (ms) xxh3 (ms) speedup
500 numpy 0.03 0.02 1.4×
500 pandas 1.0 1.0 1.0×
500 polars 0.1 0.1 1.3×
5,000 numpy 0.1 0.03 3.2×
5,000 pandas 2.9 3.3 0.9×
5,000 polars 0.2 0.1 1.8×
50,000 numpy 0.6 0.1 6.5×
50,000 pandas 32.3 33.0 1.0×
50,000 polars 0.9 0.4 2.3×
500,000 numpy 9.4 6.5 1.4×
500,000 pandas 382 382 1.0×
500,000 polars 9.5 4.2 2.3×
5,000,000 numpy 96.3 62.0 1.6×
5,000,000 pandas 6,434 6,405 1.0×
5,000,000 polars 121 72.6 1.7×

CPU: Intel i7-3770 @ 3.40 GHz · 32 GiB DDR3-1600 · pandas 3.0.3 · polars 1.41.2 · xxhash 3.7.0 · Python 3.14.2

Maybe just me, but this seems to give no benefit for pandas, some for polars, and where it would be most useful is numpy (which I am tempted to assume is a marginal use-case for using hamilton.

Compared to #1619, it seems like #1628 actually does the lions share and not the switch to xxhash. I agree with @skrawcz and @elijahbenizzy here that this makes most sense to add as hamilton[cache] to keep the dependency surface small. Just my two pennies

@Dev-iL

Dev-iL commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator Author

Maybe just me, but this seems to give no benefit for pandas, some for polars, and where it would be most useful is numpy (which I am tempted to assume is a marginal use-case for using hamilton.

Yup. I would look at this differently though:

  • Indeed, the effect of the hashing algo on pandas is insignificant. OTOH, people looking for performance don't stay on pandas.
  • The ~2x speedup on polars is amazing for such a trivially-looking change.
  • I too don't know how common numpy is, but if it's supported by hamilton, I guess there was interest in it at some point. Perhaps this is what people used for performance before polars became popular?

Compared to #1619, it seems like #1628 actually does the lions share and not the switch to xxhash. I agree with @skrawcz and @elijahbenizzy here that this makes most sense to add as hamilton[cache] to keep the dependency surface small. Just my two pennies

  • One of the reasons for splitting the PRs is so we can attribute the performance gains, honestly, to the right code change.

  • It is still not clear to me what all of you mean when you say hamilton[cache] - do we: (a) want the entire caching functionality to be locked behind an extra, or (b) do we ship "slow" caching by default and the extra means "fast caching"? Claude's analysis of this possibility:

    Two distinct approaches are considered: (A) make xxhash optional with a hashlib fallback, and (B) move caching behind hamilton[caching]. These are not the same and (A) is the dangerous one. An optional fallback means the persisted data_version depends on whether xxhash happens to be installed — so the same value fingerprints differently across machines, and a cache populated where xxhash exists is dead weight where it doesn't. For a feature whose entire pitch (per the docs) is centralizing/sharing cache across pipelines and environments, an environment-dependent fingerprint is a portability footgun. The determinism argument is not a preference, it's a correctness regression.

    Then the question becomes - do we create the new extra now, or do we split it out during a larger refactor when making pandas and numpy optional?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants