# [Performance] `nn.pad` reflect/replicate ~1.5x slower than needed since v0.21 (nested `if_then_else` boundary index)

Thanks for participating in the TVM community! We use https://discuss.tvm.ai for any general usage questions and discussions. The issue tracker is used for actionable items such as feature proposals discussion, roadmaps, and bug tracking.  You are always welcomed to post on the forum first :smile_cat:

Issues that are inactive for a period of time may get closed. We adopt this policy so that we won't lose track of actionable issues that may fall at the bottom of the pile. Feel free to reopen a new one if you feel there is an additional problem that needs attention when an old one gets closed.

### Expected behavior

`nn.pad` with `pad_mode="reflect"` / `"replicate"` on CUDA should run about as
fast as the equivalent `constant` / `circular` paths — the source index is just a
mirror / edge-clamp, which is cheap integer math.

### Actual behavior

The dedicated `reflect_pad` / `replicate_pad` TOPI compute was introduced in
**v0.21.0** (before that `nn.pad` ignored `pad_mode` and always emitted a constant
pad), and it has been slower than necessary ever since. On CUDA, `reflect` and
`replicate` run ~1.4–1.5x slower than they need to on a `[1,64,56,56]` pad-by-1,
while `constant` and `circular` are flat.

The cost comes from how the boundary index is computed in
`python/tvm/topi/nn/pad.py::reflect_pad` (and `replicate_pad`) — a nested
`if_then_else` per output element:

```python
reflected_idx = if_then_else(
    orig_idx < 0,
    -orig_idx,
    if_then_else(orig_idx >= size, (2 * size - 2) - orig_idx, orig_idx),
)
```

Rewriting the *same* index as a branchless integer formula makes it fast again,
with **bit-identical** output (`pad.py` already does `from tvm import te`, so
`te.abs` / `te.min` / `te.max` need no new import):

```python
# reflect-101: reflected = (size-1) - |(size-1) - |orig_idx||
m = size - 1
reflected_idx = m - te.abs(m - te.abs(orig_idx))

# replicate (edge) = clamp(orig_idx, 0, size-1)
clamped_idx = te.max(0, te.min(size - 1, orig_idx))
```

Measured on a reflect / replicate model (sm_89, CUDA 11.8, ncu
`sm__cycles_elapsed.avg`):

| pad_mode  | nested if_then_else | branchless | speedup | output |
|-----------|---------------------|------------|---------|--------|
| reflect   | ~22,000 cyc         | ~14,700 cyc | **1.50x** | bit-identical (max diff 0.0) |
| replicate | ~21,400 cyc         | ~15,000 cyc | **1.43x** | bit-identical |

The repro script below also shows it in plain wall-clock `time_evaluator` on the
same GPU (v0.24.0): reflect `7.37 → 4.73 us`, replicate `7.14 → 4.31 us`, while
`circular` / `constant` stay flat at ~4.5 us in both — i.e. the nested
`if_then_else` is exactly what makes reflect/replicate ~1.5x slower than the other
two modes. Same-seed output diff is `0.0` for all modes.

It's exact integer arithmetic, so the result is identical for any shape / config
(not a precision trade-off). `constant` and `circular` already use a single
expression and don't have this problem.

I have a working patch for `python/tvm/topi/nn/pad.py` that makes this swap and
verified it is bit-identical on the corpus models. **I'll open a PR with the
fix** — filing this issue first to track it.

### Environment

- TVM: slowdown first appears in **v0.21.0** (where the dedicated `reflect_pad` /
  `replicate_pad` compute was added) and is unchanged through **v0.24.0**.
- Target: CUDA, GPU sm_89, CUDA 11.8.
- OS: Linux x86_64.

### Steps to reproduce

Minimal script. It builds a single `nn.pad` relax function per mode, lowers it
for CUDA, prints the lowered TIR (so you can see the nested `if_then_else`), and
times it. It measures **`reflect` and `replicate`
together in one run** (with `circular` / `constant` as flat baselines), and runs
**standalone on one TVM version** — it reports `tvm.__version__` so the output is
self-labeling.

```python
# repro_pad.py
import os
import numpy as np
import tvm
from tvm import relax

PAD_MODES = os.environ.get("PAD_MODES", "reflect,replicate,circular,constant").split(",")
SHAPE = (1, 64, 56, 56)
PAD_W = [0, 0, 0, 0, 1, 1, 1, 1]  # pad H and W by 1 on each side
SHOW_TIR = os.environ.get("SHOW_TIR", "1") == "1"

print(f"tvm.__version__ = {tvm.__version__}")
print(f"tvm.__file__    = {tvm.__file__}")

target = tvm.target.Target("cuda")
dev = tvm.cuda()
# tvm.runtime.tensor (newer) or tvm.nd.array (classic) — both take (np, dev).
_to_nd = getattr(tvm.runtime, "tensor", None) or tvm.nd.array
rng = np.random.default_rng(0)  # fixed seed: same input across versions/patches
data = _to_nd(rng.standard_normal(SHAPE).astype("float32"), dev)


def lower_and_build(mod):
    # Prefer the official relax CUDA pipeline; fall back to the minimal
    # LegalizeOps + DefaultGPUSchedule path on plain builds.
    try:
        from tvm.relax.backend.cuda.pipeline import get_default_pipeline

        mod = get_default_pipeline(target)(mod)
    except Exception:
        mod = relax.transform.LegalizeOps()(mod)
        with target:
            mod = tvm.tir.transform.DefaultGPUSchedule()(mod)
    compile_fn = getattr(tvm, "compile", None)
    return compile_fn(mod, target=target) if compile_fn else relax.build(mod, target=target)


def bench(pad_mode):
    bb = relax.BlockBuilder()
    x = relax.Var("x", relax.TensorStructInfo(SHAPE, "float32"))
    with bb.function("main", [x]):
        with bb.dataflow():
            out = bb.emit(relax.op.nn.pad(x, PAD_W, pad_mode=pad_mode))
            gv = bb.emit_output(out)
        bb.emit_func_output(gv)
    mod = bb.get()

    if SHOW_TIR:
        # after LegalizeOps the pad PrimFunc shows the boundary-index expr
        print(f"\n===== lowered TIR: pad_mode={pad_mode} =====")
        print(relax.transform.LegalizeOps()(mod).script())

    ex = lower_and_build(mod)
    vm = relax.VirtualMachine(ex, dev)
    out = vm["main"](data)  # warmup
    res = vm.time_evaluator("main", dev, number=200, repeat=20)(data)
    return res.mean * 1e6, out.numpy()  # us, output for correctness check


results = {m: bench(m) for m in PAD_MODES}
print("\n===== timing (us) =====")
for m, (us, _) in results.items():
    print(f"{m:<10} {us:8.2f} us")

# Save outputs so a stock-vs-patched run can verify bit-identical results.
out_path = os.environ.get("SAVE_OUT")
if out_path:
    np.savez(out_path, **{m: o for m, (_, o) in results.items()})
    print(f"\nsaved outputs -> {out_path}")
```

Run it (both regressing modes + baselines in one shot):

```bash
python repro_pad.py
# or only the two regressing modes:
PAD_MODES=reflect,replicate python repro_pad.py
```

Example output on v0.24.0 (RTX 4060 Ti, sm_89) — stock nested `if_then_else` vs
the branchless patch, same fixed-seed input:

```
                stock    patched
reflect      7.37 us    4.73 us      (circular/constant ~4.5 us in both)
replicate    7.14 us    4.31 us      max|stock-patch| = 0.0 (bit-identical)
```

**Reproducing across versions.** The script does not pick a TVM version itself —
it uses whichever TVM is importable. To compare versions, run the *same* script
once per version, selecting the TVM build via an environment variable (a
per-version `PYTHONPATH`, or a per-version venv), e.g.:

```bash
for V in v0.21.0 v0.22.0 v0.23.0 v0.24.0; do
  PYTHONPATH=/path/to/tvm-$V/python python repro_pad.py
done
```

Each run prints its own `tvm.__version__` / `tvm.__file__`, so the outputs are
unambiguous. For clock-invariant numbers (instead of wall-clock
`time_evaluator`), wrap the same launch in `ncu --metrics sm__cycles_elapsed.avg`.

### Triage

* needs-triage
* topi
* performance

[repro_pad.py](https://github.com/user-attachments/files/29156294/repro_pad.py)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

# [Performance] `nn.pad` reflect/replicate ~1.5x slower than needed since v0.21 (nested `if_then_else` boundary index) #19848

Expected behavior

Actual behavior

Environment

Steps to reproduce

Triage

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

pad_mode	nested if_then_else	branchless	speedup	output
reflect	~22,000 cyc	~14,700 cyc	1.50x	bit-identical (max diff 0.0)
replicate	~21,400 cyc	~15,000 cyc	1.43x	bit-identical

# [Performance] nn.pad reflect/replicate ~1.5x slower than needed since v0.21 (nested if_then_else boundary index) #19848

Description

Expected behavior

Actual behavior

Environment

Steps to reproduce

Triage

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

# [Performance] `nn.pad` reflect/replicate ~1.5x slower than needed since v0.21 (nested `if_then_else` boundary index) #19848