Runtime fails on valid channels_last input for Conv2d + Sigmoid + spatial Mean

### 🐛 Describe the bug


## Description

A small model exported through ExecuTorch lowers successfully, but fails at runtime when the input tensor uses PyTorch `channels_last` memory format.

The same model runs correctly with a contiguous input. The `channels_last` input is valid in PyTorch and eager execution works.

## Reproducer

```python
import importlib.metadata

import torch
from torch.export import export

from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner
from executorch.exir import to_edge_transform_and_lower
from executorch.extension.pybindings.portable_lib import _load_for_executorch_from_buffer


class Model(torch.nn.Module):
    def __init__(self):
        super().__init__()
        torch.manual_seed(1014)
        self.conv = torch.nn.Conv2d(5, 11, kernel_size=3, padding=1)

    def forward(self, x):
        y = torch.sigmoid(self.conv(x))
        return y.mean(dim=(2, 3))


print("torch:", torch.__version__)
print("executorch:", importlib.metadata.version("executorch"))

model = Model().eval()

x = torch.randn(2, 5, 13, 7).to(memory_format=torch.channels_last)
print("shape:", tuple(x.shape))
print("stride:", tuple(x.stride()))
print("is_contiguous:", x.is_contiguous())
print("is_channels_last:", x.is_contiguous(memory_format=torch.channels_last))

ref = model(x)

ep = export(model, (x,), strict=True)
et = to_edge_transform_and_lower(
    ep,
    partitioner=[XnnpackPartitioner()],
).to_executorch()

mod = _load_for_executorch_from_buffer(et.buffer)
out = mod.run_method("forward", (x,))[0]

print("max diff:", (out - ref).abs().max().item())
```

## Actual Behavior

Lowering succeeds, but runtime execution fails:

```text
[tensor_util_portable.cpp:130] Check failed (all_contiguous || all_channels_last): 2 input tensors have different dim orders
[op_mean.cpp:37] Check failed (tensors_have_same_dim_order(in, out)):
[method.cpp:1387] KernelCall failed at instruction 0:1 in operator aten::mean.out: 0x12
RuntimeError: Failed to execute method forward, error: 0x12
```

## Expected Behavior

The program should either:

1. run successfully and match PyTorch eager output, or
2. fail during lowering / validation with a clear unsupported-layout diagnostic.

It seems undesirable for the official XNNPACK lowering path to successfully
produce a program that later fails in the runtime due to dim order mismatch.

## Control Case

Changing only the input layout to contiguous makes the same model pass:

```python
x = torch.randn(2, 5, 13, 7)
```

Observed result:

```text
runtime: ok
max_diff: ~2.4e-07
```

## Notes

The failing input is a valid PyTorch channels_last tensor:

```text
shape: (2, 5, 13, 7)
stride: (455, 1, 35, 5)
is_contiguous: False
is_channels_last: True
```

The model is ordinary inference code: `Conv2d -> sigmoid -> mean(dim=(2, 3))`.

The failure looks related to dim order propagation across XNNPACK delegated ops
and the remaining portable `aten::mean.out` op, but I may be misreading the
lowering boundary.

The same basic failure also reproduces through the portable lowering path
(`executorch.exir.to_edge(...).to_executorch()`), so this may be broader than an
XNNPACK-only issue.

Additional targeted scouting found similar channels_last-only runtime failures
for `sum`, `softmax`, `slice_copy`, `permute_copy`, and `add` consumers after
`Conv2d`. Contiguous controls passed in these cases.

## Local Triage

- Stable repro: yes, reproduced on `torch 2.11.0 + executorch 1.2.0`.
- Nightly repro: not verified locally. In `/data1/tzh/pt_nightly_env`
  (`torch 2.13.0.dev20260425 + executorch 1.2.0`), even contiguous control
  cases fail/abort in the runtime with unrelated-looking metadata/type errors,
  so that environment is not a clean validation target.
- Contiguous control: passes.
- Invalid-argument check: input is a valid PyTorch `channels_last` tensor and
  eager execution succeeds.
- Existing reports: searched exact error strings and did not find a matching
  public issue:
  - `tensors_have_same_dim_order`
  - `op_mean.cpp`
  - `Attempted to resize a static tensor`
  - `XNNExecutor.cpp`
  - `all_contiguous || all_channels_last`


### Versions


- torch: `2.11.0+cu130`
- executorch: `1.2.0`
- Python: `3.11`
- Platform: Linux x86_64
- Backend: XNNPACK and portable runtime
- API path:
  - `torch.export.export`
  - `executorch.exir.to_edge_transform_and_lower(..., partitioner=[XnnpackPartitioner()])`
  - `.to_executorch()`
  - Python runtime `_load_for_executorch_from_buffer`


cc @GregoryComer @digantdesai @cbilgin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runtime fails on valid channels_last input for Conv2d + Sigmoid + spatial Mean #19139

🐛 Describe the bug

Description

Reproducer

Actual Behavior

Expected Behavior

Control Case

Notes

Local Triage

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Runtime fails on valid channels_last input for Conv2d + Sigmoid + spatial Mean #19139

Description

🐛 Describe the bug

Description

Reproducer

Actual Behavior

Expected Behavior

Control Case

Notes

Local Triage

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions