Skip to content

# [Performance] nn.pad reflect/replicate ~1.5x slower than needed since v0.21 (nested if_then_else boundary index) #19848

Description

@junghyunpark2001

Thanks for participating in the TVM community! We use https://discuss.tvm.ai for any general usage questions and discussions. The issue tracker is used for actionable items such as feature proposals discussion, roadmaps, and bug tracking. You are always welcomed to post on the forum first 😸

Issues that are inactive for a period of time may get closed. We adopt this policy so that we won't lose track of actionable issues that may fall at the bottom of the pile. Feel free to reopen a new one if you feel there is an additional problem that needs attention when an old one gets closed.

Expected behavior

nn.pad with pad_mode="reflect" / "replicate" on CUDA should run about as
fast as the equivalent constant / circular paths — the source index is just a
mirror / edge-clamp, which is cheap integer math.

Actual behavior

The dedicated reflect_pad / replicate_pad TOPI compute was introduced in
v0.21.0 (before that nn.pad ignored pad_mode and always emitted a constant
pad), and it has been slower than necessary ever since. On CUDA, reflect and
replicate run ~1.4–1.5x slower than they need to on a [1,64,56,56] pad-by-1,
while constant and circular are flat.

The cost comes from how the boundary index is computed in
python/tvm/topi/nn/pad.py::reflect_pad (and replicate_pad) — a nested
if_then_else per output element:

reflected_idx = if_then_else(
    orig_idx < 0,
    -orig_idx,
    if_then_else(orig_idx >= size, (2 * size - 2) - orig_idx, orig_idx),
)

Rewriting the same index as a branchless integer formula makes it fast again,
with bit-identical output (pad.py already does from tvm import te, so
te.abs / te.min / te.max need no new import):

# reflect-101: reflected = (size-1) - |(size-1) - |orig_idx||
m = size - 1
reflected_idx = m - te.abs(m - te.abs(orig_idx))

# replicate (edge) = clamp(orig_idx, 0, size-1)
clamped_idx = te.max(0, te.min(size - 1, orig_idx))

Measured on a reflect / replicate model (sm_89, CUDA 11.8, ncu
sm__cycles_elapsed.avg):

pad_mode nested if_then_else branchless speedup output
reflect ~22,000 cyc ~14,700 cyc 1.50x bit-identical (max diff 0.0)
replicate ~21,400 cyc ~15,000 cyc 1.43x bit-identical

The repro script below also shows it in plain wall-clock time_evaluator on the
same GPU (v0.24.0): reflect 7.37 → 4.73 us, replicate 7.14 → 4.31 us, while
circular / constant stay flat at ~4.5 us in both — i.e. the nested
if_then_else is exactly what makes reflect/replicate ~1.5x slower than the other
two modes. Same-seed output diff is 0.0 for all modes.

It's exact integer arithmetic, so the result is identical for any shape / config
(not a precision trade-off). constant and circular already use a single
expression and don't have this problem.

I have a working patch for python/tvm/topi/nn/pad.py that makes this swap and
verified it is bit-identical on the corpus models. I'll open a PR with the
fix
— filing this issue first to track it.

Environment

  • TVM: slowdown first appears in v0.21.0 (where the dedicated reflect_pad /
    replicate_pad compute was added) and is unchanged through v0.24.0.
  • Target: CUDA, GPU sm_89, CUDA 11.8.
  • OS: Linux x86_64.

Steps to reproduce

Minimal script. It builds a single nn.pad relax function per mode, lowers it
for CUDA, prints the lowered TIR (so you can see the nested if_then_else), and
times it. It measures reflect and replicate
together in one run
(with circular / constant as flat baselines), and runs
standalone on one TVM version — it reports tvm.__version__ so the output is
self-labeling.

# repro_pad.py
import os
import numpy as np
import tvm
from tvm import relax

PAD_MODES = os.environ.get("PAD_MODES", "reflect,replicate,circular,constant").split(",")
SHAPE = (1, 64, 56, 56)
PAD_W = [0, 0, 0, 0, 1, 1, 1, 1]  # pad H and W by 1 on each side
SHOW_TIR = os.environ.get("SHOW_TIR", "1") == "1"

print(f"tvm.__version__ = {tvm.__version__}")
print(f"tvm.__file__    = {tvm.__file__}")

target = tvm.target.Target("cuda")
dev = tvm.cuda()
# tvm.runtime.tensor (newer) or tvm.nd.array (classic) — both take (np, dev).
_to_nd = getattr(tvm.runtime, "tensor", None) or tvm.nd.array
rng = np.random.default_rng(0)  # fixed seed: same input across versions/patches
data = _to_nd(rng.standard_normal(SHAPE).astype("float32"), dev)


def lower_and_build(mod):
    # Prefer the official relax CUDA pipeline; fall back to the minimal
    # LegalizeOps + DefaultGPUSchedule path on plain builds.
    try:
        from tvm.relax.backend.cuda.pipeline import get_default_pipeline

        mod = get_default_pipeline(target)(mod)
    except Exception:
        mod = relax.transform.LegalizeOps()(mod)
        with target:
            mod = tvm.tir.transform.DefaultGPUSchedule()(mod)
    compile_fn = getattr(tvm, "compile", None)
    return compile_fn(mod, target=target) if compile_fn else relax.build(mod, target=target)


def bench(pad_mode):
    bb = relax.BlockBuilder()
    x = relax.Var("x", relax.TensorStructInfo(SHAPE, "float32"))
    with bb.function("main", [x]):
        with bb.dataflow():
            out = bb.emit(relax.op.nn.pad(x, PAD_W, pad_mode=pad_mode))
            gv = bb.emit_output(out)
        bb.emit_func_output(gv)
    mod = bb.get()

    if SHOW_TIR:
        # after LegalizeOps the pad PrimFunc shows the boundary-index expr
        print(f"\n===== lowered TIR: pad_mode={pad_mode} =====")
        print(relax.transform.LegalizeOps()(mod).script())

    ex = lower_and_build(mod)
    vm = relax.VirtualMachine(ex, dev)
    out = vm["main"](data)  # warmup
    res = vm.time_evaluator("main", dev, number=200, repeat=20)(data)
    return res.mean * 1e6, out.numpy()  # us, output for correctness check


results = {m: bench(m) for m in PAD_MODES}
print("\n===== timing (us) =====")
for m, (us, _) in results.items():
    print(f"{m:<10} {us:8.2f} us")

# Save outputs so a stock-vs-patched run can verify bit-identical results.
out_path = os.environ.get("SAVE_OUT")
if out_path:
    np.savez(out_path, **{m: o for m, (_, o) in results.items()})
    print(f"\nsaved outputs -> {out_path}")

Run it (both regressing modes + baselines in one shot):

python repro_pad.py
# or only the two regressing modes:
PAD_MODES=reflect,replicate python repro_pad.py

Example output on v0.24.0 (RTX 4060 Ti, sm_89) — stock nested if_then_else vs
the branchless patch, same fixed-seed input:

                stock    patched
reflect      7.37 us    4.73 us      (circular/constant ~4.5 us in both)
replicate    7.14 us    4.31 us      max|stock-patch| = 0.0 (bit-identical)

Reproducing across versions. The script does not pick a TVM version itself —
it uses whichever TVM is importable. To compare versions, run the same script
once per version, selecting the TVM build via an environment variable (a
per-version PYTHONPATH, or a per-version venv), e.g.:

for V in v0.21.0 v0.22.0 v0.23.0 v0.24.0; do
  PYTHONPATH=/path/to/tvm-$V/python python repro_pad.py
done

Each run prints its own tvm.__version__ / tvm.__file__, so the outputs are
unambiguous. For clock-invariant numbers (instead of wall-clock
time_evaluator), wrap the same launch in ncu --metrics sm__cycles_elapsed.avg.

Triage

  • needs-triage
  • topi
  • performance

repro_pad.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-triagePRs or issues that need to be investigated by maintainers to find the right assignees to address ittype: bug

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions