Thanks for participating in the TVM community! We use https://discuss.tvm.ai for any general usage questions and discussions. The issue tracker is used for actionable items such as feature proposals discussion, roadmaps, and bug tracking. You are always welcomed to post on the forum first 😸
Issues that are inactive for a period of time may get closed. We adopt this policy so that we won't lose track of actionable issues that may fall at the bottom of the pile. Feel free to reopen a new one if you feel there is an additional problem that needs attention when an old one gets closed.
Expected behavior
nn.pad with pad_mode="reflect" / "replicate" on CUDA should run about as
fast as the equivalent constant / circular paths — the source index is just a
mirror / edge-clamp, which is cheap integer math.
Actual behavior
The dedicated reflect_pad / replicate_pad TOPI compute was introduced in
v0.21.0 (before that nn.pad ignored pad_mode and always emitted a constant
pad), and it has been slower than necessary ever since. On CUDA, reflect and
replicate run ~1.4–1.5x slower than they need to on a [1,64,56,56] pad-by-1,
while constant and circular are flat.
The cost comes from how the boundary index is computed in
python/tvm/topi/nn/pad.py::reflect_pad (and replicate_pad) — a nested
if_then_else per output element:
reflected_idx = if_then_else(
orig_idx < 0,
-orig_idx,
if_then_else(orig_idx >= size, (2 * size - 2) - orig_idx, orig_idx),
)
Rewriting the same index as a branchless integer formula makes it fast again,
with bit-identical output (pad.py already does from tvm import te, so
te.abs / te.min / te.max need no new import):
# reflect-101: reflected = (size-1) - |(size-1) - |orig_idx||
m = size - 1
reflected_idx = m - te.abs(m - te.abs(orig_idx))
# replicate (edge) = clamp(orig_idx, 0, size-1)
clamped_idx = te.max(0, te.min(size - 1, orig_idx))
Measured on a reflect / replicate model (sm_89, CUDA 11.8, ncu
sm__cycles_elapsed.avg):
| pad_mode |
nested if_then_else |
branchless |
speedup |
output |
| reflect |
~22,000 cyc |
~14,700 cyc |
1.50x |
bit-identical (max diff 0.0) |
| replicate |
~21,400 cyc |
~15,000 cyc |
1.43x |
bit-identical |
The repro script below also shows it in plain wall-clock time_evaluator on the
same GPU (v0.24.0): reflect 7.37 → 4.73 us, replicate 7.14 → 4.31 us, while
circular / constant stay flat at ~4.5 us in both — i.e. the nested
if_then_else is exactly what makes reflect/replicate ~1.5x slower than the other
two modes. Same-seed output diff is 0.0 for all modes.
It's exact integer arithmetic, so the result is identical for any shape / config
(not a precision trade-off). constant and circular already use a single
expression and don't have this problem.
I have a working patch for python/tvm/topi/nn/pad.py that makes this swap and
verified it is bit-identical on the corpus models. I'll open a PR with the
fix — filing this issue first to track it.
Environment
- TVM: slowdown first appears in v0.21.0 (where the dedicated
reflect_pad /
replicate_pad compute was added) and is unchanged through v0.24.0.
- Target: CUDA, GPU sm_89, CUDA 11.8.
- OS: Linux x86_64.
Steps to reproduce
Minimal script. It builds a single nn.pad relax function per mode, lowers it
for CUDA, prints the lowered TIR (so you can see the nested if_then_else), and
times it. It measures reflect and replicate
together in one run (with circular / constant as flat baselines), and runs
standalone on one TVM version — it reports tvm.__version__ so the output is
self-labeling.
# repro_pad.py
import os
import numpy as np
import tvm
from tvm import relax
PAD_MODES = os.environ.get("PAD_MODES", "reflect,replicate,circular,constant").split(",")
SHAPE = (1, 64, 56, 56)
PAD_W = [0, 0, 0, 0, 1, 1, 1, 1] # pad H and W by 1 on each side
SHOW_TIR = os.environ.get("SHOW_TIR", "1") == "1"
print(f"tvm.__version__ = {tvm.__version__}")
print(f"tvm.__file__ = {tvm.__file__}")
target = tvm.target.Target("cuda")
dev = tvm.cuda()
# tvm.runtime.tensor (newer) or tvm.nd.array (classic) — both take (np, dev).
_to_nd = getattr(tvm.runtime, "tensor", None) or tvm.nd.array
rng = np.random.default_rng(0) # fixed seed: same input across versions/patches
data = _to_nd(rng.standard_normal(SHAPE).astype("float32"), dev)
def lower_and_build(mod):
# Prefer the official relax CUDA pipeline; fall back to the minimal
# LegalizeOps + DefaultGPUSchedule path on plain builds.
try:
from tvm.relax.backend.cuda.pipeline import get_default_pipeline
mod = get_default_pipeline(target)(mod)
except Exception:
mod = relax.transform.LegalizeOps()(mod)
with target:
mod = tvm.tir.transform.DefaultGPUSchedule()(mod)
compile_fn = getattr(tvm, "compile", None)
return compile_fn(mod, target=target) if compile_fn else relax.build(mod, target=target)
def bench(pad_mode):
bb = relax.BlockBuilder()
x = relax.Var("x", relax.TensorStructInfo(SHAPE, "float32"))
with bb.function("main", [x]):
with bb.dataflow():
out = bb.emit(relax.op.nn.pad(x, PAD_W, pad_mode=pad_mode))
gv = bb.emit_output(out)
bb.emit_func_output(gv)
mod = bb.get()
if SHOW_TIR:
# after LegalizeOps the pad PrimFunc shows the boundary-index expr
print(f"\n===== lowered TIR: pad_mode={pad_mode} =====")
print(relax.transform.LegalizeOps()(mod).script())
ex = lower_and_build(mod)
vm = relax.VirtualMachine(ex, dev)
out = vm["main"](data) # warmup
res = vm.time_evaluator("main", dev, number=200, repeat=20)(data)
return res.mean * 1e6, out.numpy() # us, output for correctness check
results = {m: bench(m) for m in PAD_MODES}
print("\n===== timing (us) =====")
for m, (us, _) in results.items():
print(f"{m:<10} {us:8.2f} us")
# Save outputs so a stock-vs-patched run can verify bit-identical results.
out_path = os.environ.get("SAVE_OUT")
if out_path:
np.savez(out_path, **{m: o for m, (_, o) in results.items()})
print(f"\nsaved outputs -> {out_path}")
Run it (both regressing modes + baselines in one shot):
python repro_pad.py
# or only the two regressing modes:
PAD_MODES=reflect,replicate python repro_pad.py
Example output on v0.24.0 (RTX 4060 Ti, sm_89) — stock nested if_then_else vs
the branchless patch, same fixed-seed input:
stock patched
reflect 7.37 us 4.73 us (circular/constant ~4.5 us in both)
replicate 7.14 us 4.31 us max|stock-patch| = 0.0 (bit-identical)
Reproducing across versions. The script does not pick a TVM version itself —
it uses whichever TVM is importable. To compare versions, run the same script
once per version, selecting the TVM build via an environment variable (a
per-version PYTHONPATH, or a per-version venv), e.g.:
for V in v0.21.0 v0.22.0 v0.23.0 v0.24.0; do
PYTHONPATH=/path/to/tvm-$V/python python repro_pad.py
done
Each run prints its own tvm.__version__ / tvm.__file__, so the outputs are
unambiguous. For clock-invariant numbers (instead of wall-clock
time_evaluator), wrap the same launch in ncu --metrics sm__cycles_elapsed.avg.
Triage
- needs-triage
- topi
- performance
repro_pad.py
Thanks for participating in the TVM community! We use https://discuss.tvm.ai for any general usage questions and discussions. The issue tracker is used for actionable items such as feature proposals discussion, roadmaps, and bug tracking. You are always welcomed to post on the forum first 😸
Issues that are inactive for a period of time may get closed. We adopt this policy so that we won't lose track of actionable issues that may fall at the bottom of the pile. Feel free to reopen a new one if you feel there is an additional problem that needs attention when an old one gets closed.
Expected behavior
nn.padwithpad_mode="reflect"/"replicate"on CUDA should run about asfast as the equivalent
constant/circularpaths — the source index is just amirror / edge-clamp, which is cheap integer math.
Actual behavior
The dedicated
reflect_pad/replicate_padTOPI compute was introduced inv0.21.0 (before that
nn.padignoredpad_modeand always emitted a constantpad), and it has been slower than necessary ever since. On CUDA,
reflectandreplicaterun ~1.4–1.5x slower than they need to on a[1,64,56,56]pad-by-1,while
constantandcircularare flat.The cost comes from how the boundary index is computed in
python/tvm/topi/nn/pad.py::reflect_pad(andreplicate_pad) — a nestedif_then_elseper output element:Rewriting the same index as a branchless integer formula makes it fast again,
with bit-identical output (
pad.pyalready doesfrom tvm import te, sote.abs/te.min/te.maxneed no new import):Measured on a reflect / replicate model (sm_89, CUDA 11.8, ncu
sm__cycles_elapsed.avg):The repro script below also shows it in plain wall-clock
time_evaluatoron thesame GPU (v0.24.0): reflect
7.37 → 4.73 us, replicate7.14 → 4.31 us, whilecircular/constantstay flat at ~4.5 us in both — i.e. the nestedif_then_elseis exactly what makes reflect/replicate ~1.5x slower than the othertwo modes. Same-seed output diff is
0.0for all modes.It's exact integer arithmetic, so the result is identical for any shape / config
(not a precision trade-off).
constantandcircularalready use a singleexpression and don't have this problem.
I have a working patch for
python/tvm/topi/nn/pad.pythat makes this swap andverified it is bit-identical on the corpus models. I'll open a PR with the
fix — filing this issue first to track it.
Environment
reflect_pad/replicate_padcompute was added) and is unchanged through v0.24.0.Steps to reproduce
Minimal script. It builds a single
nn.padrelax function per mode, lowers itfor CUDA, prints the lowered TIR (so you can see the nested
if_then_else), andtimes it. It measures
reflectandreplicatetogether in one run (with
circular/constantas flat baselines), and runsstandalone on one TVM version — it reports
tvm.__version__so the output isself-labeling.
Run it (both regressing modes + baselines in one shot):
python repro_pad.py # or only the two regressing modes: PAD_MODES=reflect,replicate python repro_pad.pyExample output on v0.24.0 (RTX 4060 Ti, sm_89) — stock nested
if_then_elsevsthe branchless patch, same fixed-seed input:
Reproducing across versions. The script does not pick a TVM version itself —
it uses whichever TVM is importable. To compare versions, run the same script
once per version, selecting the TVM build via an environment variable (a
per-version
PYTHONPATH, or a per-version venv), e.g.:Each run prints its own
tvm.__version__/tvm.__file__, so the outputs areunambiguous. For clock-invariant numbers (instead of wall-clock
time_evaluator), wrap the same launch inncu --metrics sm__cycles_elapsed.avg.Triage
repro_pad.py