Merge latest develop into meta/fmha-fwd-block-sparsity (resolve conflicts) by poyenc · Pull Request #1 · goldcoderZ/rocm-libraries

poyenc · 2026-06-19T19:23:22Z

Summary

Resolves the merge conflicts blocking PR ROCm#6534 by merging the latest ROCm/rocm-libraries:develop into this branch.

The branch had diverged: 39 commits ahead of a merge base that develop was ~5455 commits past. Three files conflicted, all in the FMHA forward example/test:

example/ck_tile/01_fmha/example_fmha_fwd.cpp
example/ck_tile/01_fmha/fmha_fwd_runner.hpp
test/ck_tile/fmha/test_fmha_fwd.cpp

Resolution

All conflicts were the same "both-added" shape: this branch added a block_mask parameter while develop independently added pack_gqa. Both were kept, with two correctness details:

In the .insert(...) fluent arg-parser chain, dropped the statement-terminating ; after block_mask so the chain continues into pack_gqa.
Ordered arguments block_mask_str -> pack_gqa at every call site to match the runner signature in fmha_fwd_runner.hpp (positional args).

Formatting

C++ files formatted with clang-format-18 using projects/composablekernel/.clang-format.
No Python files were touched.

Verification

Zero conflict markers remain.
git diff of the merge result vs origin/develop shows only this branch's block_mask additions; develop's pack_gqa preserved.

How to land

Merge this into meta/fmha-fwd-block-sparsity; that advances the head branch of ROCm#6534 and clears its conflict state against develop.

…Cm#8282) ## Motivation When using the `invoke` workflow (`invoke rocisa` → edit rocisa C++ → `invoke build-client` → `Tensile/bin/Tensile`), rocisa bindings could become stale: 1. **Stale bindings bug:** `invoke build-client` only rebuilds a bundled `_rocisa.so` in `build_tmp/`, but `Tensile/bin/Tensile` imports from the *editable* `_rocisa.so` in site-packages. Editing rocisa C++ and running only `build-client` left the editable bindings stale — surfacing as an `ImportError` if `_build_info.py` was present, or silently using wrong bindings otherwise. 2. **Wasted compilation:** The bundled `_rocisa.so` in `build_tmp/` was always built (due to a flag bug) even though it's unused in the `tensilelite` preset configuration (`HIPBLASLT_ENABLE_DEVICE=OFF` means no device-library codegen). ## Technical Details ### Commit 1: Fix `--bundle-python-deps` flag The `bundle_python_deps` parameter defaulted to `False`, but the code only passed `-DHIPBLASLT_BUNDLE_PYTHON_DEPS=ON` when `True` and nothing when `False`. Since the CMake option defaults to `ON`, the flag was effectively a no-op. Fix: Always pass the value explicitly so `False` → `=OFF`. This eliminates the unnecessary compilation of the bundled `_rocisa.so` during `invoke build-client`. ### Commit 2: Refresh editable rocisa in `build-client` `build-client` now re-runs the editable install (`pip install -e rocisa/`) before building the client. This is safe and fast because: - **Conditional:** Only triggers when rocisa is installed *editable* (detected via PEP 610 `direct_url.json`). Never clobbers tox's non-editable install or attempts to install rocisa where absent. - **Graceful degradation:** Warns (doesn't fail) when build backend (`scikit-build-core`/`nanobind`) is unavailable. - **Incremental:** New persistent `build-dir` in `pyproject.toml` makes reinstalls a cmake/make no-op when nothing changed. New flag `--no-rebuild-rocisa` to opt out. ### Commit 3: Add rocisa build dependencies to tox.ini Add `scikit-build-core` to tox deps (alongside existing `nanobind`) so the rocisa build toolchain is available in tox environments. ## Test Plan - [x] `invoke rocisa` still works (initial editable install) - [x] `invoke build-client` works when rocisa is NOT installed (no-op rebuild, no failure) - [x] `invoke build-client` works when rocisa IS installed editable (triggers rebuild) - [x] After editing a rocisa `.cpp` file, `invoke build-client` followed by `import rocisa` works (no stale `ImportError`) - [x] `invoke build-client --no-rebuild-rocisa` skips the rocisa rebuild - [x] `invoke build-client --bundle-python-deps` still builds the bundled `_rocisa.so` when explicitly requested - [x] Default `invoke build-client` passes `HIPBLASLT_BUNDLE_PYTHON_DEPS=OFF` (verified in CMakeCache.txt) - [x] tox workflows (`tox -e rocisa`: 48 tests, 0 failures) ## Test Results | Test | Result | |------|--------| | `invoke rocisa` (initial editable install) | ✅ Pass | | `invoke build-client` when rocisa NOT installed | ✅ Pass (no-op, no failure) | | `invoke build-client` when rocisa IS installed editable | ✅ Pass (triggers rebuild) | | Edit `.cpp` → `invoke build-client` → `import rocisa` | ✅ Pass (no stale ImportError) | | `--no-rebuild-rocisa` skips rebuild | ✅ Pass | | `--bundle-python-deps` enables bundling | ✅ Pass (`BUNDLE=ON` in cache) | | Default passes `BUNDLE=OFF` | ✅ Pass (no wasted `_rocisa` compilation) | | Full cmake configure | ✅ Pass | | `tox -e rocisa` | ✅ Pass (48 tests, 0 failures, 0 errors) | ## Checklist - [x] Code compiles without errors - [x] New helper functions have docstrings explaining behavior and edge cases - [x] Changes are backwards compatible (new behavior is opt-out via `--no-rebuild-rocisa`) - [x] Tested on a system with rocisa build toolchain - [x] tox workflows verified Co-authored-by: Nathan Henderson <nathan.henderson@amd.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>

…y for MGX" (ROCm#8333) Reverts ROCm#6086 Need to revert as the codegen test for fmha is failing due to including std header: 2026-06-11T22:36:03.673Z] In file included from /tmp/comgr-953928-0-473822/include/ck/host/device_fmha_fwd/fmha_fwd_wrapper.hpp:8: [2026-06-11T22:36:03.673Z] In file included from /bin/../lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/cmath:49: [2026-06-11T22:36:03.673Z] In file included from /bin/../lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/bits/std_abs.h:38: [2026-06-11T22:36:03.673Z] /usr/include/stdlib.h:32:10: fatal error: 'stddef.h' file not found [2026-06-11T22:36:03.673Z] 32 | #include <stddef.h> [2026-06-11T22:36:03.673Z] | ^~~~~~~~~~ The ck_tile headers were never prepped for hiprtc compilation.

…inx in the tensile-docs-dependencies group (ROCm#8364) Bumps the tensile-docs-dependencies group in /shared/tensile/docs/sphinx with 1 update: [rocm-docs-core](https://github.com/ROCm/rocm-docs-core). Updates `rocm-docs-core` from 1.33.1 to 1.35.0 <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/ROCm/rocm-docs-core/releases">rocm-docs-core's releases</a>.</em></p> <blockquote> <h2>v1.35.0 (2026-06-10)</h2> <h3>Feat</h3> <ul> <li>recognize X.Y.Z-preview slugs as static intersphinx versions</li> <li>Fix gate/emission mismatch and improve headings in llms-full.txt</li> <li>Add llms-full.txt generation and LLM friendly guide</li> <li>Ignore .cline_storage and .vscode directories</li> </ul> <h3>Fix</h3> <ul> <li>add new repo sync pair</li> </ul> <p>[main b623885] bump: version 1.34.0 → 1.35.0 3 files changed, 17 insertions(+), 4 deletions(-)</p> <h2>v1.34.0 (2026-04-28)</h2> <h3>Feat</h3> <ul> <li>add hipFile</li> <li>Add CUID project</li> </ul> <h3>Fix</h3> <ul> <li>fix typo</li> </ul> <p>[main 98ca6ee] bump: version 1.33.1 → 1.34.0 3 files changed, 15 insertions(+), 4 deletions(-)</p> </blockquote> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md">rocm-docs-core's changelog</a>.</em></p> <blockquote> <h2>v1.35.0 (2026-06-10)</h2> <h3>Feat</h3> <ul> <li>recognize X.Y.Z-preview slugs as static intersphinx versions</li> <li>Fix gate/emission mismatch and improve headings in llms-full.txt</li> <li>Add llms-full.txt generation and LLM friendly guide</li> <li>Ignore .cline_storage and .vscode directories</li> </ul> <h3>Fix</h3> <ul> <li>add new repo sync pair</li> </ul> <h2>v1.34.0 (2026-04-28)</h2> <h3>Feat</h3> <ul> <li>add hipFile</li> <li>Add CUID project</li> </ul> <h3>Fix</h3> <ul> <li>fix typo</li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/ROCm/rocm-docs-core/commit/b623885e7e4fe2f87bd2898d8ea0e2c4ded2eca1"><code>b623885</code></a> bump: version 1.34.0 → 1.35.0</li> <li><a href="https://github.com/ROCm/rocm-docs-core/commit/d237fdefe1b60214be99e05e0be78e04b09684be"><code>d237fde</code></a> Merge pull request <a href="https://redirect.github.com/ROCm/rocm-docs-core/issues/1543">#1543</a> from ROCm/intersphinx-preview-version-pattern</li> <li><a href="https://github.com/ROCm/rocm-docs-core/commit/a2eb21e60b3ff8fc22dfec3ea0d5261ea7af0444"><code>a2eb21e</code></a> Merge pull request <a href="https://redirect.github.com/ROCm/rocm-docs-core/issues/1531">#1531</a> from ROCm/llms_full_generation</li> <li><a href="https://github.com/ROCm/rocm-docs-core/commit/1d09ac5a83f44f81abd9fba6fbac13deb1a57855"><code>1d09ac5</code></a> build: bump myst-nb from 1.3.0 to 1.4.0</li> <li><a href="https://github.com/ROCm/rocm-docs-core/commit/92aa649c8df2450600a0ffb694b2005bc9e06c35"><code>92aa649</code></a> build: bump gitpython from 3.1.45 to 3.1.50</li> <li><a href="https://github.com/ROCm/rocm-docs-core/commit/d06450c2a2ea0a2adb87f9d2c6284c0f5e7bb6ec"><code>d06450c</code></a> build: bump pyjwt from 2.10.1 to 2.13.0</li> <li><a href="https://github.com/ROCm/rocm-docs-core/commit/bdb2eeb4e9d1c343cd6c2ede949a1d480ff87ffd"><code>bdb2eeb</code></a> feat: Fix gate/emission mismatch and improve headings in llms-full.txt</li> <li><a href="https://github.com/ROCm/rocm-docs-core/commit/45d3e9ac19b5d55542c9fde28246767f8cab1fd7"><code>45d3e9a</code></a> feat: Add llms-full.txt generation and LLM friendly guide</li> <li><a href="https://github.com/ROCm/rocm-docs-core/commit/eabde61dfe27908776c13f08670cb3fa69929ec2"><code>eabde61</code></a> feat: recognize X.Y.Z-preview slugs as static intersphinx versions</li> <li><a href="https://github.com/ROCm/rocm-docs-core/commit/000057c2d597ae14e3997738e52b7ca99011b2a9"><code>000057c</code></a> build: bump idna from 3.11 to 3.17</li> <li>Additional commits viewable in <a href="https://github.com/ROCm/rocm-docs-core/compare/v1.33.1...v1.35.0">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=rocm-docs-core&package-manager=pip&previous-version=1.33.1&new-version=1.35.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore <dependency name> major version` will close this group update PR and stop Dependabot creating any more for the specific dependency's major version (unless you unignore this specific dependency's major version or upgrade to it yourself) - `@dependabot ignore <dependency name> minor version` will close this group update PR and stop Dependabot creating any more for the specific dependency's minor version (unless you unignore this specific dependency's minor version or upgrade to it yourself) - `@dependabot ignore <dependency name>` will close this group update PR and stop Dependabot creating any more for the specific dependency (unless you unignore this specific dependency or upgrade to it yourself) - `@dependabot unignore <dependency name>` will remove all of the ignore conditions of the specified dependency - `@dependabot unignore <dependency name> <ignore condition>` will remove the ignore condition of the specified dependency and ignore conditions </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…phinx in the rocblas-docs-dependencies group (ROCm#8360) Bumps the rocblas-docs-dependencies group in /projects/rocblas/docs/sphinx with 1 update: [rocm-docs-core](https://github.com/ROCm/rocm-docs-core). Updates `rocm-docs-core` from 1.33.1 to 1.35.0 <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/ROCm/rocm-docs-core/releases">rocm-docs-core's releases</a>.</em></p> <blockquote> <h2>v1.35.0 (2026-06-10)</h2> <h3>Feat</h3> <ul> <li>recognize X.Y.Z-preview slugs as static intersphinx versions</li> <li>Fix gate/emission mismatch and improve headings in llms-full.txt</li> <li>Add llms-full.txt generation and LLM friendly guide</li> <li>Ignore .cline_storage and .vscode directories</li> </ul> <h3>Fix</h3> <ul> <li>add new repo sync pair</li> </ul> <p>[main b623885] bump: version 1.34.0 → 1.35.0 3 files changed, 17 insertions(+), 4 deletions(-)</p> <h2>v1.34.0 (2026-04-28)</h2> <h3>Feat</h3> <ul> <li>add hipFile</li> <li>Add CUID project</li> </ul> <h3>Fix</h3> <ul> <li>fix typo</li> </ul> <p>[main 98ca6ee] bump: version 1.33.1 → 1.34.0 3 files changed, 15 insertions(+), 4 deletions(-)</p> </blockquote> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md">rocm-docs-core's changelog</a>.</em></p> <blockquote> <h2>v1.35.0 (2026-06-10)</h2> <h3>Feat</h3> <ul> <li>recognize X.Y.Z-preview slugs as static intersphinx versions</li> <li>Fix gate/emission mismatch and improve headings in llms-full.txt</li> <li>Add llms-full.txt generation and LLM friendly guide</li> <li>Ignore .cline_storage and .vscode directories</li> </ul> <h3>Fix</h3> <ul> <li>add new repo sync pair</li> </ul> <h2>v1.34.0 (2026-04-28)</h2> <h3>Feat</h3> <ul> <li>add hipFile</li> <li>Add CUID project</li> </ul> <h3>Fix</h3> <ul> <li>fix typo</li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/ROCm/rocm-docs-core/commit/b623885e7e4fe2f87bd2898d8ea0e2c4ded2eca1"><code>b623885</code></a> bump: version 1.34.0 → 1.35.0</li> <li><a href="https://github.com/ROCm/rocm-docs-core/commit/d237fdefe1b60214be99e05e0be78e04b09684be"><code>d237fde</code></a> Merge pull request <a href="https://redirect.github.com/ROCm/rocm-docs-core/issues/1543">#1543</a> from ROCm/intersphinx-preview-version-pattern</li> <li><a href="https://github.com/ROCm/rocm-docs-core/commit/a2eb21e60b3ff8fc22dfec3ea0d5261ea7af0444"><code>a2eb21e</code></a> Merge pull request <a href="https://redirect.github.com/ROCm/rocm-docs-core/issues/1531">#1531</a> from ROCm/llms_full_generation</li> <li><a href="https://github.com/ROCm/rocm-docs-core/commit/1d09ac5a83f44f81abd9fba6fbac13deb1a57855"><code>1d09ac5</code></a> build: bump myst-nb from 1.3.0 to 1.4.0</li> <li><a href="https://github.com/ROCm/rocm-docs-core/commit/92aa649c8df2450600a0ffb694b2005bc9e06c35"><code>92aa649</code></a> build: bump gitpython from 3.1.45 to 3.1.50</li> <li><a href="https://github.com/ROCm/rocm-docs-core/commit/d06450c2a2ea0a2adb87f9d2c6284c0f5e7bb6ec"><code>d06450c</code></a> build: bump pyjwt from 2.10.1 to 2.13.0</li> <li><a href="https://github.com/ROCm/rocm-docs-core/commit/bdb2eeb4e9d1c343cd6c2ede949a1d480ff87ffd"><code>bdb2eeb</code></a> feat: Fix gate/emission mismatch and improve headings in llms-full.txt</li> <li><a href="https://github.com/ROCm/rocm-docs-core/commit/45d3e9ac19b5d55542c9fde28246767f8cab1fd7"><code>45d3e9a</code></a> feat: Add llms-full.txt generation and LLM friendly guide</li> <li><a href="https://github.com/ROCm/rocm-docs-core/commit/eabde61dfe27908776c13f08670cb3fa69929ec2"><code>eabde61</code></a> feat: recognize X.Y.Z-preview slugs as static intersphinx versions</li> <li><a href="https://github.com/ROCm/rocm-docs-core/commit/000057c2d597ae14e3997738e52b7ca99011b2a9"><code>000057c</code></a> build: bump idna from 3.11 to 3.17</li> <li>Additional commits viewable in <a href="https://github.com/ROCm/rocm-docs-core/compare/v1.33.1...v1.35.0">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=rocm-docs-core&package-manager=pip&previous-version=1.33.1&new-version=1.35.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore <dependency name> major version` will close this group update PR and stop Dependabot creating any more for the specific dependency's major version (unless you unignore this specific dependency's major version or upgrade to it yourself) - `@dependabot ignore <dependency name> minor version` will close this group update PR and stop Dependabot creating any more for the specific dependency's minor version (unless you unignore this specific dependency's minor version or upgrade to it yourself) - `@dependabot ignore <dependency name>` will close this group update PR and stop Dependabot creating any more for the specific dependency (unless you unignore this specific dependency or upgrade to it yourself) - `@dependabot unignore <dependency name>` will remove all of the ignore conditions of the specified dependency - `@dependabot unignore <dependency name> <ignore condition>` will remove the ignore condition of the specified dependency and ignore conditions </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…phinx in the hipblas-docs-dependencies group (ROCm#8358) Bumps the hipblas-docs-dependencies group in /projects/hipblas/docs/sphinx with 1 update: [rocm-docs-core](https://github.com/ROCm/rocm-docs-core). Updates `rocm-docs-core` from 1.33.1 to 1.35.0 <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/ROCm/rocm-docs-core/releases">rocm-docs-core's releases</a>.</em></p> <blockquote> <h2>v1.35.0 (2026-06-10)</h2> <h3>Feat</h3> <ul> <li>recognize X.Y.Z-preview slugs as static intersphinx versions</li> <li>Fix gate/emission mismatch and improve headings in llms-full.txt</li> <li>Add llms-full.txt generation and LLM friendly guide</li> <li>Ignore .cline_storage and .vscode directories</li> </ul> <h3>Fix</h3> <ul> <li>add new repo sync pair</li> </ul> <p>[main b623885] bump: version 1.34.0 → 1.35.0 3 files changed, 17 insertions(+), 4 deletions(-)</p> <h2>v1.34.0 (2026-04-28)</h2> <h3>Feat</h3> <ul> <li>add hipFile</li> <li>Add CUID project</li> </ul> <h3>Fix</h3> <ul> <li>fix typo</li> </ul> <p>[main 98ca6ee] bump: version 1.33.1 → 1.34.0 3 files changed, 15 insertions(+), 4 deletions(-)</p> </blockquote> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md">rocm-docs-core's changelog</a>.</em></p> <blockquote> <h2>v1.35.0 (2026-06-10)</h2> <h3>Feat</h3> <ul> <li>recognize X.Y.Z-preview slugs as static intersphinx versions</li> <li>Fix gate/emission mismatch and improve headings in llms-full.txt</li> <li>Add llms-full.txt generation and LLM friendly guide</li> <li>Ignore .cline_storage and .vscode directories</li> </ul> <h3>Fix</h3> <ul> <li>add new repo sync pair</li> </ul> <h2>v1.34.0 (2026-04-28)</h2> <h3>Feat</h3> <ul> <li>add hipFile</li> <li>Add CUID project</li> </ul> <h3>Fix</h3> <ul> <li>fix typo</li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/ROCm/rocm-docs-core/commit/b623885e7e4fe2f87bd2898d8ea0e2c4ded2eca1"><code>b623885</code></a> bump: version 1.34.0 → 1.35.0</li> <li><a href="https://github.com/ROCm/rocm-docs-core/commit/d237fdefe1b60214be99e05e0be78e04b09684be"><code>d237fde</code></a> Merge pull request <a href="https://redirect.github.com/ROCm/rocm-docs-core/issues/1543">#1543</a> from ROCm/intersphinx-preview-version-pattern</li> <li><a href="https://github.com/ROCm/rocm-docs-core/commit/a2eb21e60b3ff8fc22dfec3ea0d5261ea7af0444"><code>a2eb21e</code></a> Merge pull request <a href="https://redirect.github.com/ROCm/rocm-docs-core/issues/1531">#1531</a> from ROCm/llms_full_generation</li> <li><a href="https://github.com/ROCm/rocm-docs-core/commit/1d09ac5a83f44f81abd9fba6fbac13deb1a57855"><code>1d09ac5</code></a> build: bump myst-nb from 1.3.0 to 1.4.0</li> <li><a href="https://github.com/ROCm/rocm-docs-core/commit/92aa649c8df2450600a0ffb694b2005bc9e06c35"><code>92aa649</code></a> build: bump gitpython from 3.1.45 to 3.1.50</li> <li><a href="https://github.com/ROCm/rocm-docs-core/commit/d06450c2a2ea0a2adb87f9d2c6284c0f5e7bb6ec"><code>d06450c</code></a> build: bump pyjwt from 2.10.1 to 2.13.0</li> <li><a href="https://github.com/ROCm/rocm-docs-core/commit/bdb2eeb4e9d1c343cd6c2ede949a1d480ff87ffd"><code>bdb2eeb</code></a> feat: Fix gate/emission mismatch and improve headings in llms-full.txt</li> <li><a href="https://github.com/ROCm/rocm-docs-core/commit/45d3e9ac19b5d55542c9fde28246767f8cab1fd7"><code>45d3e9a</code></a> feat: Add llms-full.txt generation and LLM friendly guide</li> <li><a href="https://github.com/ROCm/rocm-docs-core/commit/eabde61dfe27908776c13f08670cb3fa69929ec2"><code>eabde61</code></a> feat: recognize X.Y.Z-preview slugs as static intersphinx versions</li> <li><a href="https://github.com/ROCm/rocm-docs-core/commit/000057c2d597ae14e3997738e52b7ca99011b2a9"><code>000057c</code></a> build: bump idna from 3.11 to 3.17</li> <li>Additional commits viewable in <a href="https://github.com/ROCm/rocm-docs-core/compare/v1.33.1...v1.35.0">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=rocm-docs-core&package-manager=pip&previous-version=1.33.1&new-version=1.35.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore <dependency name> major version` will close this group update PR and stop Dependabot creating any more for the specific dependency's major version (unless you unignore this specific dependency's major version or upgrade to it yourself) - `@dependabot ignore <dependency name> minor version` will close this group update PR and stop Dependabot creating any more for the specific dependency's minor version (unless you unignore this specific dependency's minor version or upgrade to it yourself) - `@dependabot ignore <dependency name>` will close this group update PR and stop Dependabot creating any more for the specific dependency (unless you unignore this specific dependency or upgrade to it yourself) - `@dependabot unignore <dependency name>` will remove all of the ignore conditions of the specified dependency - `@dependabot unignore <dependency name> <ignore condition>` will remove the ignore condition of the specified dependency and ignore conditions </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…Cm#8386) Bumps [tornado](https://github.com/tornadoweb/tornado) from 6.5.5 to 6.5.6. <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/tornadoweb/tornado/blob/master/docs/releases.rst">tornado's changelog</a>.</em></p> <blockquote> <h1>Release notes</h1> <p>.. toctree:: :maxdepth: 2</p> <p>releases/v6.5.7 releases/v6.5.6 releases/v6.5.5 releases/v6.5.4 releases/v6.5.3 releases/v6.5.2 releases/v6.5.1 releases/v6.5.0 releases/v6.4.2 releases/v6.4.1 releases/v6.4.0 releases/v6.3.3 releases/v6.3.2 releases/v6.3.1 releases/v6.3.0 releases/v6.2.0 releases/v6.1.0 releases/v6.0.4 releases/v6.0.3 releases/v6.0.2 releases/v6.0.1 releases/v6.0.0 releases/v5.1.1 releases/v5.1.0 releases/v5.0.2 releases/v5.0.1 releases/v5.0.0 releases/v4.5.3 releases/v4.5.2 releases/v4.5.1 releases/v4.5.0 releases/v4.4.3 releases/v4.4.2 releases/v4.4.1 releases/v4.4.0 releases/v4.3.0 releases/v4.2.1 releases/v4.2.0 releases/v4.1.0 releases/v4.0.2 releases/v4.0.1 releases/v4.0.0 releases/v3.2.2 releases/v3.2.1</p>  </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/tornadoweb/tornado/commit/aba2569f7ed7a6bdbef816658fb6b7182531b751"><code>aba2569</code></a> Merge pull request <a href="https://redirect.github.com/tornadoweb/tornado/issues/3626">#3626</a> from bdarnell/fixes-656</li> <li><a href="https://github.com/tornadoweb/tornado/commit/a24b260e0d22fd48acea1a2635526c1700e7ac09"><code>a24b260</code></a> httpclient_test: Accept an additional error message variant</li> <li><a href="https://github.com/tornadoweb/tornado/commit/a74240a70268fe5cb40a127951cb21549ab9ff24"><code>a74240a</code></a> Release notes and version bump for 6.5.6.</li> <li><a href="https://github.com/tornadoweb/tornado/commit/e8fc7edb238f1022e39f9d0b9d297fc7c21fb0a5"><code>e8fc7ed</code></a> simple_httpclient: Strip auth headers on cross-origin redirects</li> <li><a href="https://github.com/tornadoweb/tornado/commit/96dc88c2a05705287856b2cd6b4b4034f9a6aaac"><code>96dc88c</code></a> speedups: validate mask length</li> <li><a href="https://github.com/tornadoweb/tornado/commit/ff808b33adc52d89a549376a5e3628e92abbc8ff"><code>ff808b3</code></a> http1connection: Enforce max_body_size in _GzipMessageDelegate</li> <li><a href="https://github.com/tornadoweb/tornado/commit/ede4e37f93c1edbc0bf749e9a57c9db2501cd54b"><code>ede4e37</code></a> auth: Correctly parse check_authentication response</li> <li><a href="https://github.com/tornadoweb/tornado/commit/1c178bef88bbd29907eb94a2a649a4a6675681de"><code>1c178be</code></a> Remove obsolete curl force_timeout workaround</li> <li><a href="https://github.com/tornadoweb/tornado/commit/c99d55bb6cc0c9da2c6696545ed4ee1d20b7fcf0"><code>c99d55b</code></a> Replace deprecated pycurl IOCTLFUNCTION callback with SEEKFUNCTION</li> <li><a href="https://github.com/tornadoweb/tornado/commit/27614316ef8ad125fe18725cf96e384560ba0e14"><code>2761431</code></a> Merge pull request <a href="https://redirect.github.com/tornadoweb/tornado/issues/3587">#3587</a> from bdarnell/fix-link</li> <li>Additional commits viewable in <a href="https://github.com/tornadoweb/tornado/compare/v6.5.5...v6.5.6">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=tornado&package-manager=pip&previous-version=6.5.5&new-version=6.5.6)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/ROCm/rocm-libraries/network/alerts). </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…#8385) Bumps [tornado](https://github.com/tornadoweb/tornado) from 6.5.5 to 6.5.6. <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/tornadoweb/tornado/blob/master/docs/releases.rst">tornado's changelog</a>.</em></p> <blockquote> <h1>Release notes</h1> <p>.. toctree:: :maxdepth: 2</p> <p>releases/v6.5.7 releases/v6.5.6 releases/v6.5.5 releases/v6.5.4 releases/v6.5.3 releases/v6.5.2 releases/v6.5.1 releases/v6.5.0 releases/v6.4.2 releases/v6.4.1 releases/v6.4.0 releases/v6.3.3 releases/v6.3.2 releases/v6.3.1 releases/v6.3.0 releases/v6.2.0 releases/v6.1.0 releases/v6.0.4 releases/v6.0.3 releases/v6.0.2 releases/v6.0.1 releases/v6.0.0 releases/v5.1.1 releases/v5.1.0 releases/v5.0.2 releases/v5.0.1 releases/v5.0.0 releases/v4.5.3 releases/v4.5.2 releases/v4.5.1 releases/v4.5.0 releases/v4.4.3 releases/v4.4.2 releases/v4.4.1 releases/v4.4.0 releases/v4.3.0 releases/v4.2.1 releases/v4.2.0 releases/v4.1.0 releases/v4.0.2 releases/v4.0.1 releases/v4.0.0 releases/v3.2.2 releases/v3.2.1</p>  </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/tornadoweb/tornado/commit/aba2569f7ed7a6bdbef816658fb6b7182531b751"><code>aba2569</code></a> Merge pull request <a href="https://redirect.github.com/tornadoweb/tornado/issues/3626">#3626</a> from bdarnell/fixes-656</li> <li><a href="https://github.com/tornadoweb/tornado/commit/a24b260e0d22fd48acea1a2635526c1700e7ac09"><code>a24b260</code></a> httpclient_test: Accept an additional error message variant</li> <li><a href="https://github.com/tornadoweb/tornado/commit/a74240a70268fe5cb40a127951cb21549ab9ff24"><code>a74240a</code></a> Release notes and version bump for 6.5.6.</li> <li><a href="https://github.com/tornadoweb/tornado/commit/e8fc7edb238f1022e39f9d0b9d297fc7c21fb0a5"><code>e8fc7ed</code></a> simple_httpclient: Strip auth headers on cross-origin redirects</li> <li><a href="https://github.com/tornadoweb/tornado/commit/96dc88c2a05705287856b2cd6b4b4034f9a6aaac"><code>96dc88c</code></a> speedups: validate mask length</li> <li><a href="https://github.com/tornadoweb/tornado/commit/ff808b33adc52d89a549376a5e3628e92abbc8ff"><code>ff808b3</code></a> http1connection: Enforce max_body_size in _GzipMessageDelegate</li> <li><a href="https://github.com/tornadoweb/tornado/commit/ede4e37f93c1edbc0bf749e9a57c9db2501cd54b"><code>ede4e37</code></a> auth: Correctly parse check_authentication response</li> <li><a href="https://github.com/tornadoweb/tornado/commit/1c178bef88bbd29907eb94a2a649a4a6675681de"><code>1c178be</code></a> Remove obsolete curl force_timeout workaround</li> <li><a href="https://github.com/tornadoweb/tornado/commit/c99d55bb6cc0c9da2c6696545ed4ee1d20b7fcf0"><code>c99d55b</code></a> Replace deprecated pycurl IOCTLFUNCTION callback with SEEKFUNCTION</li> <li><a href="https://github.com/tornadoweb/tornado/commit/27614316ef8ad125fe18725cf96e384560ba0e14"><code>2761431</code></a> Merge pull request <a href="https://redirect.github.com/tornadoweb/tornado/issues/3587">#3587</a> from bdarnell/fix-link</li> <li>Additional commits viewable in <a href="https://github.com/tornadoweb/tornado/compare/v6.5.5...v6.5.6">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=tornado&package-manager=pip&previous-version=6.5.5&new-version=6.5.6)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/ROCm/rocm-libraries/network/alerts). </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

## Motivation At the time when we introduced the smart test filter to only build and run tests affected by the PR changes, we disabled the client examples, which required full CK build, and also the hiprtc tests that were grouped with the client examples. This caused a few PRs to sneak through that caused the hiprtc compilation to fail. By restoring the hiprtc tests in all PRs, we should close this gap. ## Technical Details  ## Test Plan  ## Test Result  ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

ROCm#8006) Commit ROCm#7644 ("Exclude F6 from data initialization") renamed TensileLite::Client::isMXProblem to isMXProblemExceptF6 across the header and client sources, but DataInit_test.cpp was not updated, breaking the test build with "use of undeclared identifier 'isMXProblem'". Update the using-declaration, all call sites, and the Section 2 contract comment to the new name. isMXProblemExceptF6 (not the compiler-suggested isMXFP4Problem) preserves the original test semantics, since the suite expects FP8/BFloat8/mixed MX problems to be true as well as FP4.

…OCm#8390) Bumps [tornado](https://github.com/tornadoweb/tornado) from 6.5.5 to 6.5.6. <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/tornadoweb/tornado/blob/master/docs/releases.rst">tornado's changelog</a>.</em></p> <blockquote> <h1>Release notes</h1> <p>.. toctree:: :maxdepth: 2</p> <p>releases/v6.5.7 releases/v6.5.6 releases/v6.5.5 releases/v6.5.4 releases/v6.5.3 releases/v6.5.2 releases/v6.5.1 releases/v6.5.0 releases/v6.4.2 releases/v6.4.1 releases/v6.4.0 releases/v6.3.3 releases/v6.3.2 releases/v6.3.1 releases/v6.3.0 releases/v6.2.0 releases/v6.1.0 releases/v6.0.4 releases/v6.0.3 releases/v6.0.2 releases/v6.0.1 releases/v6.0.0 releases/v5.1.1 releases/v5.1.0 releases/v5.0.2 releases/v5.0.1 releases/v5.0.0 releases/v4.5.3 releases/v4.5.2 releases/v4.5.1 releases/v4.5.0 releases/v4.4.3 releases/v4.4.2 releases/v4.4.1 releases/v4.4.0 releases/v4.3.0 releases/v4.2.1 releases/v4.2.0 releases/v4.1.0 releases/v4.0.2 releases/v4.0.1 releases/v4.0.0 releases/v3.2.2 releases/v3.2.1</p>  </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/tornadoweb/tornado/commit/aba2569f7ed7a6bdbef816658fb6b7182531b751"><code>aba2569</code></a> Merge pull request <a href="https://redirect.github.com/tornadoweb/tornado/issues/3626">#3626</a> from bdarnell/fixes-656</li> <li><a href="https://github.com/tornadoweb/tornado/commit/a24b260e0d22fd48acea1a2635526c1700e7ac09"><code>a24b260</code></a> httpclient_test: Accept an additional error message variant</li> <li><a href="https://github.com/tornadoweb/tornado/commit/a74240a70268fe5cb40a127951cb21549ab9ff24"><code>a74240a</code></a> Release notes and version bump for 6.5.6.</li> <li><a href="https://github.com/tornadoweb/tornado/commit/e8fc7edb238f1022e39f9d0b9d297fc7c21fb0a5"><code>e8fc7ed</code></a> simple_httpclient: Strip auth headers on cross-origin redirects</li> <li><a href="https://github.com/tornadoweb/tornado/commit/96dc88c2a05705287856b2cd6b4b4034f9a6aaac"><code>96dc88c</code></a> speedups: validate mask length</li> <li><a href="https://github.com/tornadoweb/tornado/commit/ff808b33adc52d89a549376a5e3628e92abbc8ff"><code>ff808b3</code></a> http1connection: Enforce max_body_size in _GzipMessageDelegate</li> <li><a href="https://github.com/tornadoweb/tornado/commit/ede4e37f93c1edbc0bf749e9a57c9db2501cd54b"><code>ede4e37</code></a> auth: Correctly parse check_authentication response</li> <li><a href="https://github.com/tornadoweb/tornado/commit/1c178bef88bbd29907eb94a2a649a4a6675681de"><code>1c178be</code></a> Remove obsolete curl force_timeout workaround</li> <li><a href="https://github.com/tornadoweb/tornado/commit/c99d55bb6cc0c9da2c6696545ed4ee1d20b7fcf0"><code>c99d55b</code></a> Replace deprecated pycurl IOCTLFUNCTION callback with SEEKFUNCTION</li> <li><a href="https://github.com/tornadoweb/tornado/commit/27614316ef8ad125fe18725cf96e384560ba0e14"><code>2761431</code></a> Merge pull request <a href="https://redirect.github.com/tornadoweb/tornado/issues/3587">#3587</a> from bdarnell/fix-link</li> <li>Additional commits viewable in <a href="https://github.com/tornadoweb/tornado/compare/v6.5.5...v6.5.6">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=tornado&package-manager=pip&previous-version=6.5.5&new-version=6.5.6)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/ROCm/rocm-libraries/network/alerts). </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…m#8378) ## Motivation The CK Groovy library is growing and will be reorganized into a self-describing `groovy/` folder rather than living under `src/` and `vars/`. This PR creates that folder pre-emptively and adds it to the TheRock CI skip-list so that future Groovy additions do not unnecessarily trigger TheRock builds. ## Technical Details - Added `projects/composablekernel/groovy/` with a `.gitkeep` to establish the directory in the repo. - Added `"projects/composablekernel/groovy/*"` to `SKIPPABLE_PATH_PATTERNS` in `.github/scripts/therock_configure_ci.py` alongside the existing `vars/*` entry, ensuring changes confined to Groovy pipeline code are recognized as non-therock-relevant and skip the TheRock CI pipeline. ## Test Plan No code logic was changed. Verified that `therock_configure_ci.py` pattern list is consistent with the existing `vars/*` skip entry and that the new pattern follows the same glob convention. ## Test Result N/A — directory scaffolding and CI filter only; no functional code affected. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

## Motivation This pull request makes a small update to the `logging_mode_gtest.yaml` test configuration. The change narrows the GPU architecture pattern from `'12??'` to `'120?'`, making the test more specific in targeting GPU architectures. ROCM-25049

…new tests from test filter standardization (ROCm#8188) ## Motivation  New tests were added to hipcub, rocprim, and rocthrust without setting their RESOURCE_GROUPS properties. This may cause test failures when running ctest in parallel, e.g., `--parallel 8 --resource-spec-file resources.json`. This PR adds the missing RESOURCE_GROUPS properties. ## Technical Details  Without RESOURCE_GROUPS properties specified, ctest may launch these new tests all at once, causing them to compete on GPU resources. This often leads to failed/crashed/aborted tests and CI failures. ## Test Plan  Run `ctest --parallel 8 --resource-spec-file resources.json` and all tests should pass. ## Test Result  All tests pass on local testing. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Song <hsong@ctr2-alola-ctrl-01.amd.com>

## Motivation  Add FFM test filters. ## Technical Details  This leverages the test filter standardization work. ## Test Plan  ctest --print-labels to see the new test categories/labels added ctest -L ffm-quick to run tests in the new category ## Test Result  Tests in the specified category are executed. Tests not in the specified category do something like "Running 0 tests from 0 test suites" - they are not executed. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Song <hsong@ctr2-alola-ctrl-01.amd.com>

## Motivation Some rocSPARSE pre_checkin tests were outside our time constraints significantly. This PR reduces the pre_checkin test times for the routines: bsrsm, bsrgeam, bsrilu0, spsm_csr, spsm_coo, spmm_csr, and gtsv

## Summary Several CK source files carry Windows **CRLF** line endings (a trailing carriage return on each line), introduced by editors configured for Windows endings or copy/paste from Windows tooling. These are purely cosmetic but they pollute diffs (whole-file churn the first time someone makes an LF edit), confuse `clang-format`, and are inconsistent with the LF-only convention used across the rest of the tree. This PR (a) normalizes every existing CRLF file (6 files) to LF and (b) adds a pre-checkin gate so new CRLF leaks are rejected before merge. ## File extensions covered Both the cleanup scan and the new Jenkins enforcement stage use the same predicate as the adjacent `ASCII Only Check` stage: ``` *.h *.hpp *.cpp *.h.in *.hpp.in *.cpp.in *.inc *.cl ``` (excluding `*/build/*` and `*/include/rapidjson/*`). The local pre-commit hook's `c++/inc` type filter covers the same set. ## Why no enforcement today CK is opted out of the rocm-libraries root `.pre-commit-config.yaml`, so the existing `pre-commit` workflow doesn't touch CK. The local CK `.pre-commit-config.yaml` only runs for developers who installed hooks. The **authoritative gate is therefore the new Jenkins stage** in this PR; the local hook is convenience. ## Commit layout (bisect-friendly) 1. `[ck] Normalize CRLF line endings to LF in C/C++ sources` Mechanical line-ending cleanup across 6 files. No content change: every edit is purely CRLF -> LF, verified with `git diff --ignore-cr-at-eol` reporting an empty diff. 2. `[ck] Enforce LF-only line endings in C/C++ sources` - New `projects/composablekernel/script/check_no_crlf.sh` (modeled on `check_ascii_only.sh`). - New `crlf-checker` entry in `projects/composablekernel/.pre-commit-config.yaml` under the local-hooks block (`types_or: [c++, inc]`). - New `CRLF Check` parallel stage in `projects/composablekernel/Jenkinsfile`'s `Static checks` block, mirroring the adjacent `ASCII Only Check` stage. Always-on, no `RUN_CPPCHECK` gate. The tree is buildable at every commit boundary. Commit 1 leaves 0 CRLF violations; commit 2 wires the gate. ## Demo Script output on a synthesized violation: ``` $ printf 'int main() {}\r\n' > /tmp/bad.cpp $ projects/composablekernel/script/check_no_crlf.sh /tmp/bad.cpp ERROR: /tmp/bad.cpp contains CRLF (Windows) line endings: 1:int main() {}<CR> Fix: convert to LF, e.g. 'sed -i 's/\r$//' /tmp/bad.cpp' or 'dos2unix /tmp/bad.cpp' $ echo $? 1 ``` Full repo scan after the cleanup commit: ``` $ cd projects/composablekernel && find . -type f $ -name '*.h' -o -name '*.hpp' -o -name '*.cpp' \ -o -name '*.h.in' -o -name '*.hpp.in' -o -name '*.cpp.in' -o -name '*.inc' -o -name '*.cl' $ \ -not -path '*/build/*' -not -path '*/include/rapidjson/*' -print0 \ | xargs -0 -P 8 -n 64 script/check_no_crlf.sh $ echo $? 0 ``` ## Test plan - [ ] Jenkins PR build: confirm new `Static checks -> CRLF Check` stage runs green over the full predicate and the existing `ASCII Only Check` / `Clang Format` stages are unaffected. - [ ] Local: `pre-commit run crlf-checker --all-files` runs cleanly after installing CK pre-commit hooks. - [ ] Manually inject a CRLF line ending in any `.cpp/.hpp/.inc` file, push: confirm Jenkins fails the new stage with a clear error. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4 (1M context) <noreply@anthropic.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

## Motivation Look-back scan uses flags to communicate the state of blocks. Active polling is used to keep checking until a block either is in a partial state or a complete state. However, in the current implementation this is modeled as a per thread process. Meaning, each thread will keep polling global memory until only itself is in a valid state. But the entire wavefront only exits this loop when all threads have found a valid state. This can be potentially inefficient. While a block may present itself as partial, it may still be promoted to a complete block (but this will be invisible to the current implementation). And since the entire wavefront may still be working as long as an invalid state exists, it may unnecessarily look back onto other previous states. ## Technical Details The change here is to change flag getting to be a wave-cooperative and only exiting the active polling loop if all lanes have found a valid flag (via the `warp_any()` crosslane operator). Also the `get()` utility did duplicate flag retrieval, so the flag retrieval and other operations in the function are separated. ## Test Result <img width="1200" height="540" alt="device_scan_942" src="https://github.com/user-attachments/assets/6bd4695a-96b3-4a67-aa7d-ee8b03034187" /> <img width="1200" height="420" alt="device_scan_deterministic_942" src="https://github.com/user-attachments/assets/7af78c61-24d6-4463-a6b1-5d1e291b3fab" /> [lookback_scan.zip](https://github.com/user-attachments/files/26824239/lookback_scan.zip) ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Sander Bos <sander@streamhpc.com>

## Motivation Ragged tensors are tensors with variable per batch dimensions derived from an associated ragged offset tensor. This RFC outlines a plan for integrating them into the hipdnn frontend and flatbuffers, as well as presenting a design for utility classes which would provide users with the ability to allocate ragged tensors with proper indexing schemes. Furthermore, it details how these classes fit into the CPU reference validation integration testing framework. The immediate need for this work is expanding support for the SDPA forward CPU references so that they can match the functionality of AITER's SDPA forward kernels. With that in mind, this RFC uses those operations as an example on how this integration fits in with this process. ## Technical Details Technical details included in RFC ## Test Plan Docs only, no testing needed ## Test Result Docs only, no testing needed ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. ## Risk level None, this PR only adds documentation. **Associated ticket**: ALMIOPEN-1958

…m#8220) ## Motivation On the RetinaNet shapes (gfx950, fp16) CK Tile backward-data conv was ~18% behind classic CK, with the gap concentrated in the K=2376 3x3 detection-head family where bwd_data spends most of its time. The WAVELET GEMM pipeline already gives uplift for forward and backward-weight conv; this ports it to backward-data and consolidates the now-shared machinery across all three directions. ## Technical Details - Backward-data wavelet support in the tile kernel: launch extra load waves when the pipeline exposes `LaunchBlockSize`, and split the epilogue into math waves (run the CShuffle epilogue) and load waves (`RunBarrierStub`). - Register 7 WAVELET instances (fp16 and bf16), tuned for backward-data's tall-skinny GEMM rather than the forward tile shapes: a big-M `256/128/64` workhorse, a `VecA=4` variant for the `K % 8 != 0` shapes, and a `NumGroupsToMerge=32` variant for grouped (depthwise-style) shapes. - Implement the native backward-data instance parser in `generate_instances.py`. - Deduplicate the wavelet machinery shared by forward, backward-data, and backward-weight: `GroupedConvLaunchBlockSize`, `is_wavelet_pipeline`, and `RunWaveletAwareEpilogue` in `grouped_convolution_utils.hpp`; the three native instance parsers collapse to one parameterized parser. The three kernels now call the shared helpers. ## Test Plan - Rebuild the full profiler instance pools for all three directions (fp16/bf16/fp32, nhwgc/ndhwgc) to exercise the shared helpers across every instantiation. - Tile GTests on gfx950: `test_grouped_convnd_fwd_tile`, `test_grouped_convnd_bwd_data_tile`, `test_grouped_convnd_bwd_weight_tile`. - Per-shape sweep of the 35 RetinaNet backward-data shapes vs classic CK and the non-wavelet tile pool (`profile_wavelet_bwd_data.py`); correctness spot-checked with GPU-reference verification on the new big-M and NumGroupsToMerge instances. ## Test Result - GTests pass: forward 9/9, backward-data 6/6, backward-weight 6/6. - Backward-data perf (3x3 g=1 region, geomean classic/tile): 0.88 -> 1.11, i.e. the tile path goes from ~12% slower than classic to ~8% faster. The largest single backward-data shape (256x100x100->2376) moves from 11% slower than classic to 12.5% faster. - The dedup refactor preserves behavior (net -174 lines across the kernels/generator), confirmed by the full rebuild and the GTests above. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

## Motivation Dependency parser (dapper) requires the ninja build system. ## Technical Details This changes all CI builds over to ninja. Risk is low since it is very nearly a drop-in replacement for `make`. Build times are typically faster than `make`. ## Test Plan MICI. ## Test Result <TBD, should pass> ## Submission Checklist - [X] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: SreecharanGundaboluAMD <sgundabo@amd.com>

…m expected (ROCm#8161)

## Motivation Fix a parameter type mismatch in TensileLite StreamK metadata handling so validation and serialization stay consistent with `validParameters`, eliminating avoidable YAML parameter type mismatch warnings in StreamK runs. ## Technical Details - Updated `tensilelite/Tensile/SolutionStructs/Solution.py` so `DirectToLdsMetadata` assignments use integer `0`/`1` instead of boolean `False`/`True`. - This aligns Solution state updates with the existing `validParameters` definition (`DirectToLdsMetadata: [0, 1]`) and avoids bool/int drift during Python/msgpack serialization. - Updated `tensilelite/Tensile/Tests/common/streamk/sk_dynamic.yaml` typing to match expected schemas for remaining mismatch fields (`MIArchVgpr: [False]`, `GlobalReadPerMfma: [1.0]`; commit `b7aeb1b83c`). - `sk_hybrid.yaml` was not validated on this branch because `StreamK=5` is unsupported on `develop`. ## Test Plan - Run unit validation test focused on parameter typing: - `tox -e unit -- Tensile/Tests/unit/test_validateParameterTypes.py -v` - Run StreamK dynamic client flow with `sk_dynamic.yaml` and inspect logs for YAML parameter type mismatch warnings. ## Test Result - `tox -e unit -- Tensile/Tests/unit/test_validateParameterTypes.py -v` -> `60 passed`. - `sk_dynamic.yaml` client run completed with exit code `0`. - No YAML parameter type mismatch warnings for `DirectToLdsMetadata`, `GlobalReadPerMfma`, or `MIArchVgpr` after this fix. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Cursor <cursoragent@cursor.com>

## Motivation Port the `ConvOclDirectFwd` direct convolution solver from OpenCL to HIP, as part of the ongoing effort to remove the OpenCL backend dependency from MIOpen. ## Technical Details Add a new `ConvHipDirectFwd` solver and `MIOpenConvDirUniHip.cpp` HIP kernel that replaces the OpenCL-based `ConvOclDirectFwd` solver and `MIOpenConvDirUni.cl` kernel. The old solver and kernel are removed. Databases are updated by just replacing the solver name without updating the measurements because we need to run the tuning and DB updating round after all OCL-to-HIP conversions are completed. **New files:** - `src/kernels/MIOpenConvDirUniHip.cpp` — HIP kernel ported from the OCL `MIOpenConvDirUni.cl` kernel - `src/solver/conv/conv_hip_dir2Dfwd.cpp` — HIP solver ported from `conv_ocl_dir2Dfwd.cpp` - `src/solver/conv/conv_hip_dir2Dfwd_exhaustive_search.cpp` — exhaustive search ported from `conv_ocl_dir2Dfwd_exhaustive_search.cpp` **Removed files:** - `src/kernels/MIOpenConvDirUni.cl` — replaced by `MIOpenConvDirUniHip.cpp` - `src/solver/conv/conv_ocl_dir2Dfwd.cpp` — replaced by `conv_hip_dir2Dfwd.cpp` - `src/solver/conv/conv_ocl_dir2Dfwd_exhaustive_search.cpp` — replaced by `conv_hip_dir2Dfwd_exhaustive_search.cpp` **Modified files:** - `src/include/miopen/conv/solvers.hpp` — rename `ConvOclDirectFwd` → `ConvHipDirectFwd` and `ConvOclDirectFwdLegacyExhaustiveSearch` → `ConvHipDirectFwdLegacyExhaustiveSearch`; add backward-compat aliases for the fused solver - `src/solver.cpp` — register `ConvHipDirectFwd` with a new ID at end; replace old `ConvOclDirectFwd` registration with `++id` placeholder - `src/mlo_dir_conv.cpp` — replace `ConvOclDirectFwd` with `ConvHipDirectFwd` in the direct solver container - `src/fin/fin_interface.cpp` — update legacy ID 11 mapping to use `ConvHipDirectFwd` - `src/CMakeLists.txt` — add new source files, remove old ones - `docs/how-to/debug-log.rst`, `docs/reference/env_variables.rst` — update env var name `MIOPEN_DEBUG_CONV_DIRECT_OCL_FWD` → `MIOPEN_DEBUG_CONV_DIRECT_HIP_FWD` **Tests:** - `test/gtest/unit_conv_solver_ConvHipDirectFwd.cpp` — renamed from `unit_conv_solver_ConvOclDirectFwd.cpp`, updated solver references - `test/gtest/unit_FinInterface.cpp` — updated solver name and ID ## Test Plan - Ran unit tests verifying solver applicability and correctness (`unit_conv_solver_ConvHipDirectFwd`) - OCL-vs-HIP performance comparison benchmarks across ~14k shapes from FDB on gfx90a (MI-210) covering FP32, FP16, and BF16 — **is being collected** ## Test Result Benchmark results for gfx90a. We have a noticeable performance gain after OCL to HIP conversion for FP16. ### FP16 | Ratio (OCL/HIP) | Count | % | |------------------|------:|--:| | < 0.5 | 0 | 0.0% | | 0.5 - 0.7 | 36 | 0.6% | | 0.7 - 0.9 | 189 | 3.3% | | 0.9 - 1.0 | 725 | 12.8% | | 1.0 - 1.1 | 707 | 12.5% | | 1.1 - 1.3 | 991 | 17.6% | | 1.3 - 1.5 | 835 | 14.8% | | 1.5 - 2.0 | 559 | 9.9% | | 2.0 - 3.0 | 967 | 17.1% | | > 3.0 | 636 | 11.3% | | Metric | Value | |--------|-------| | Total shapes | 5645 | | Mean ratio | 1.6784 | | Min ratio | 0.5051 | | Max ratio | 3.3972 | | HIP regression (< 0.9) | 225 (4.0%) | | Parity (0.9 - 1.1) | 1432 (25.4%) | | HIP faster (> 1.1) | 3988 (70.6%) | | Shapes with nonzero max_abs_err | 0 (0.0%) | ### BP16 - to be added | Ratio (OCL/HIP) | Count | % | |------------------|------:|--:| | < 0.5 | 185 | 4.7% | | 0.5 - 0.7 | 65 | 1.6% | | 0.7 - 0.9 | 686 | 17.3% | | 0.9 - 1.0 | 1242 | 31.3% | | 1.0 - 1.1 | 832 | 21.0% | | 1.1 - 1.3 | 487 | 12.3% | | 1.3 - 1.5 | 345 | 8.7% | | 1.5 - 2.0 | 78 | 2.0% | | 2.0 - 3.0 | 23 | 0.6% | | > 3.0 | 24 | 0.6% | | Metric | Value | |--------|-------| | Total shapes | 3967 | | Mean ratio | 1.0279 | | Min ratio | 0.2912 | | Max ratio | 7.8593 | | HIP regression (< 0.9) | 936 (23.6%) | | Parity (0.9 - 1.1) | 2076 (52.3%) | | HIP faster (> 1.1) | 955 (24.1%) | | Shapes with nonzero max_abs_err | 0 (0.0%) | ### FP32 - to be added ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: BradPepersAMD <Brad.Pepers@amd.com> Co-authored-by: Daming Feng <dmfeng8898@gmail.com>

## Summary This PR relaxes BatchNorm frontend validation to allow broadcastable shapes for scale and bias tensors, addressing issue ROCm#6409. Previously, BatchNorm required scale and bias to strictly follow the shape `{1, C, 1, 1}`, enforced via `validateChannelOnlyTensorShape`. This restriction is inconsistent with other normalization operators such as LayerNorm and the recently updated RMSNorm (PR ROCm#6359), both of which allow more flexible, broadcastable shapes. This change aligns BatchNorm with the same design philosophy, making the frontend more permissive while leaving backend-specific constraints to the plugin layer. --- ## Changes ### Validation Updates - Removed strict `{1, C, 1, 1}` requirement for scale and bias tensors - Replaced channel-only validation with: - Minimum dimension validation (`>= 1`) - Shape consistency check between scale and bias - Now supports broadcastable shapes such as: - `{C}` - `{1, C}` - `{1, C, 1, 1}` (previously required format) ### Preserved Constraints - Scale and bias must still have **matching shapes** - Mean and inverse variance tensors remain **strictly channel-shaped** - Input/output tensor shape consistency is unchanged - Spatial validation (`N * spatial > 1`) remains intact - Running statistics validation logic is unchanged --- ## Tests - Added tests to cover broadcastable scale/bias shapes: - Rank-1 (`{C}`) - Rank-2 (`{1, C}`) - Ensured existing tests reflect relaxed frontend validation - Removed outdated test enforcing strict channel-dimension matching - Verified all edge cases: - Shape mismatches - Missing attributes - Spatial constraints - Running statistics - 4D and 5D inputs All tests pass locally (22/22) --- ## Design Considerations - Frontend validation is intentionally permissive: - Accepts broadcastable shapes - Does **not** enforce backend-specific constraints - Backend providers (e.g. MIOpen) remain responsible for: - Validating supported tensor formats - Rejecting unsupported broadcast cases This separation ensures: - Consistency across normalization operators - Flexibility for future backend improvements --- ## Notes - MIOpen currently supports only `{1, C, 1, 1}` scale/bias tensors - Broader shape support will require future updates in the miopen-provider and MIOpen itself --------- Co-authored-by: BrianHarrisonAMD <169072757+BrianHarrisonAMD@users.noreply.github.com>

…OCm#7754)" (ROCm#8356) This reverts commit 4807f77. ## Motivation Original changes requires refactor to be reintroduced along with matching unit tests. ROCM-25647

…OCm#8262) ## Motivation Add HIP graph capture support for FMHA backward operations. The original implementation only supported normal execution mode and would cause use-after-free crashes when used with graph capture replay. When FMHA backward is captured into a HIP graph: - First replay: host callback executes and deletes the closure (as designed for normal mode) - Subsequent replays: use-after-free crash because the closure was already freed This PR enables `fmha_bwd_launcher::prepare_workspace_async()` to work correctly in both normal execution and graph capture modes.

…neLibrary and TensileUpdateLibrary (ROCm#7703) ## Motivation Continuation of the TensileLite Python unit-test coverage uplift: this PR adds pytest suites for four library-management modules (`TensileLogic/Run`, `TensileMergeLibrary`, `TensileRetuneLibrary`, `TensileUpdateLibrary`), all at 0% coverage on `develop`. **In addition**, this PR makes three product-code changes in `TensileRetuneLibrary.py` and `TensileUpdateLibrary.py` that were either prerequisites for unit-testing the modules in isolation (the import refactor) or bugs/dead code uncovered while writing the tests. All three are detailed below. Tracker (internal): https://amd-hub.atlassian.net/browse/AIHPBLAS-3405. Pairs with ROCm#7722 — see "Cross-PR coupling" below. ## Product code changes This PR is **not** tests-only. Three product-code changes are bundled: 1. **Granular import refactor** (`TensileRetuneLibrary.py`, `TensileUpdateLibrary.py`). Replaced the umbrella `from .Common import …` lines with specific submodule imports — `from .Common.GlobalParameters import …`, `from .Common.Utilities import …`, `from .Common.Constants import …`, etc. **Rationale:** lets the modules be unit-tested without pulling the full `Common` package and its hardware-side initialization. **Safety:** every name removed from the umbrella is either reachable via the new granular imports or still re-exported from `Common/__init__.py` via existing `from .Constants/.Utilities/.Parallel import *` lines (and ROCm#7722 explicitly re-exports the four `GlobalParameters` names — see Cross-PR coupling). 2. **Dead-code removal** (`TensileRetuneLibrary.py`). Removed an exact-duplicate copy of `pushWorkingPath`, `popWorkingPath`, `ensurePath`, and `setWorkingPath` (defined twice in the file on `develop`, lines 45–66 and again at 69–90 with identical bodies — the second copy was unreachable shadowing). **Safety:** the surviving copies are byte-for-byte identical to the removed ones. 3. **Bug fix** (`TensileRetuneLibrary.py::parseCurrentLibrary`). `GlobalParameters.globalParameters["PerformanceMetric"] = libYaml[10]` → `globalParameters["PerformanceMetric"] = libYaml[10]`. The qualified form referenced a name (`GlobalParameters`) that is never imported in this module — it would have raised `NameError` if the optional 11th element of `libYaml` ever appeared in a real config. **Locked by test:** `Tests/unit/test_TensileRetuneLibrary.py::TestParseCurrentLibrary::test_parses_library_without_size_file` mocks `LibraryIO.read` to return an 11-element list, forcing the `if len(libYaml) > 10:` branch. ## Cross-PR coupling This PR pairs with **ROCm#7722**, which adds `from .GlobalParameters import globalParameters, assignGlobalParameters, restoreDefaultGlobalParameters, __version__` to `Tensile/Common/__init__.py`. That re-export preserves the historical `from Tensile.Common import globalParameters` path for any other internal or external caller after this PR's refactor lands. The two PRs are reviewable independently but ideally land together; if landed in either order, no caller is broken (existing `*`-imports in `Common/__init__.py` already cover the other names). ## Technical Details **Modules covered in this PR:** | Module | Before | After | |--------|-------:|------:| | `Tensile/TensileLogic/Run.py` | 0.00% | **94.86%** | | `Tensile/TensileMergeLibrary.py` | 0.00% | **96.40%** | | `Tensile/TensileRetuneLibrary.py` | 0.00% | **94.87%** | | `Tensile/TensileUpdateLibrary.py` | 0.00% | **95.65%** | Project-wide TensileLite total: 24.66% → **25.68%**. **Testing approach:** - Pure-Python unit tests, no code-generator invocation. - Extensive mocking with `unittest.mock` for all external dependencies (file I/O, YAML read/write, `validateToolchain`, `ClientWriter`, `LibraryLogic`, `ProblemType`, `Solution`, `ProblemSizes`). - All tests marked `@pytest.mark.unit`. - Coverage targeted at normal execution paths, error conditions, edge cases, state changes, and CLI-argument parsing. Reference commit for the test additions: ROCm@a62795d ## Test Plan Pure-Python unit tests, CPU only. Run via the existing tensilelite tox `unit` env: ```bash cd projects/hipblaslt/tensilelite tox -e unit -- \ Tensile/Tests/unit/test_TensileLogic_Run.py \ Tensile/Tests/unit/test_TensileMergeLibrary.py \ Tensile/Tests/unit/test_TensileRetuneLibrary.py \ Tensile/Tests/unit/test_TensileUpdateLibrary.py ``` The bug fix in `parseCurrentLibrary` is exercised specifically by `test_TensileRetuneLibrary.py::TestParseCurrentLibrary::test_parses_library_without_size_file`. ## Test Result Tests pass; new modules at the coverage levels listed in Technical Details. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Signed-off-by: pdhirajkumarprasad <dhirajp@amd.com>

## Motivation  Add FFM test filters. ## Technical Details  This leverages the test filter standardization work. ## Test Plan  ctest --print-labels to see the new test categories/labels added ctest -L ffm-quick to run tests in the new category ## Test Result  Tests in the specified category are executed. Tests not in the specified category do something like "Running 0 tests from 0 test suites" - they are not executed. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Song <hsong@ctr2-alola-ctrl-01.amd.com>

…#8204) ## Motivation  SIA0+PGR2+PLR0 TDM kernels gave wrong results because the cross-wave TDM buffer swap barrier didn't wait for in-flight local reads to drain ## Technical Details  Insert a dscnt=0 wait before the TDM swap _syncThreads barrier when numItersPLR==0, ensuring all local reads finish before the buffer is reused. ## Test Plan  Add a case ## Test Result  Pass tox test (those failed case are known issues: sk no solution) <img width="2557" height="215" alt="image" src="https://github.com/user-attachments/assets/1e0fb8b2-2513-4c7a-844c-dba6fb97898f" /> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

## Motivation Failed CI stages (e.g. Static checks) were left stuck on a `pending` GitHub status instead of reporting `failure`, so PRs showed an overall failure with no indication of which check actually failed. ## Technical Details `buildAndTest` posted `pending`/`success` statuses but its catch only rethrew, deferring failure reporting to `runOnHealthyNode` — which deferred right back. Neither posted `failure`. This adds a `failure` status post for real build errors in `buildAndTest`, while letting node-reroute signals (`NodeFault`/`TransientFault`) and aborts (`FlowInterruptedException`) propagate untouched so retries still work. Since every stage routes through `buildAndTest`, this fixes both the directly-called `Static checks` stage and the `runOnHealthyNode`-wrapped per-arch build stages. ## Test Plan Trigger a stage failure (e.g. introduce a clang-format violation) and confirm the corresponding GitHub status context transitions `pending` → `failure` rather than remaining `pending`. ## Test Result Pending CI run on a branch with a deliberate failure to confirm the status transition. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

) ## Motivation CI builds intermittently fail on transient git DNS blips (e.g. `Could notresolve host: github.com`). These surface as an untyped `exit code 1`, which the existing node/transient-fault retry doesn't catch — so a momentary glitch fails the whole build. ## Technical Details Added `gitNetRetry(label, body)` (3 attempts, 15s backoff) and wrapped every github.com-touching git step: ref-repo clone/update, `checkout scm`, and the hipTensor clone. All are idempotent on retry. Docker pulls are left to the existing `pullImage()` path. ## Test Plan - Mapped the failing build's `git remote update` DNS error to a now-wrapped call. - Confirmed no existing code retries git host-resolution failures. ## Test Result Groovy shared-library — not locally executable; needs a pipeline run to fully validate. Check CI. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…m#8533) ## What The **Configure CI** job (`component-ci.yml`) over-triggers component jobs for PRs whose branch is behind `develop`. Example: PR ROCm#8498 touches only `shared/stinkytofu/**` yet ran the full **MIOpen** matrix (and rocISA). ## Root cause The step overrode `BASE_REF` with the PR base-branch **tip**: ```yaml env: BASE_REF: ${{ github.event.pull_request.base.sha || 'HEAD^' }} ``` `component_ci.py` → `get_modified_paths()` then does a **two-dot** diff: `git diff --name-only <base.sha>`. When the PR branch lags behind `develop`, `base.sha` is **not an ancestor** of the PR head, so `git diff base.sha HEAD` reports every file merged into `develop` since the fork point — not just the PR's changes. Verified on ROCm#8498's actual CI SHAs: | Diff | Files detected | |---|---| | `git diff base.sha HEAD` (two-dot, current) | 90 `projects/` incl. **9 `projects/miopen/`** + 4 `shared/` | | PR's real change set | 3 `shared/stinkytofu/` files | `git merge-base --is-ancestor base.sha HEAD` → exit 1, confirming base.sha is not an ancestor. ## Fix Drop the `BASE_REF` override so it defaults to `HEAD^`, matching the working pattern already used by `therock-ci.yml` (which uses the same `get_modified_paths` function with no override). On a `pull_request` event GitHub checks out the **merge commit**, whose `HEAD^` is the base-branch tip — so `git diff HEAD^` yields exactly the PR's own changes, regardless of how far behind the branch is. The existing `fetch-depth: 2` already provides `HEAD^`. No change to `ci_utils.py` — it stays identical to what `therock-ci.yml` relies on.

@spolifroni-amd

## Motivation RPP sets `RPP_AUDIO_SUPPORT=ON` by default. However, we do not pull recursively in the Rock and FFTS is never installed so by default this should be OFF. Only turn on when FFTS is found ## Technical Details Changed the order of check in CMakelists.txt. This is already documented by @spolifroni-amd in ROCm#8377 ## Test Plan Build and run ctests ## Test Result Build and ctests should pass ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

…ion (ROCm#8147) # Fix uninitialized read of a_type/b_type in `rocblaslt_matmul_valid_args` ## Summary `rocblaslt_matmul_valid_args` passes its output-reference parameters `a_type` and `b_type` *by value* to `validateMatmulSwizzleArgs` **before** they are assigned from `matA->type` / `matB->type`. The callee then evaluates `isValidOrderForDatatype(b_type, matB->order)` against uninitialized stack memory, causing spurious `rocblaslt_status_invalid_value` returns whenever the garbage happens to compare equal to one of the checked `hipDataType` constants. Fix: pass `matA->type` / `matB->type` directly. One-line change. ## Background The bug surfaced as 21 `pre_checkin_matmul_swizzleB_f8_fnuz_*` failures in [TheRock CI run 26923271597](https://github.com/ROCm/TheRock/actions/runs/26923271597), originally bisected to the rocm-systems bump `d34cbb6..315a69d`. Investigation showed: - The failure is **not** caused by any rocm-systems commit. None of the 7 CLR-touching commits in that range actually trigger the bug when individually reverted. - The bug reliably reproduces at tip-of-develop in `rocm-systems` (where the suspect PR ROCm#4755 has already been reverted upstream by `c528809566`) when HIP is built in **Debug** mode. - The bug also reproduces with rocm-systems pinned at `315a69d`, regardless of which CLR commits are present. - The rocm-systems bump merely shifted stack layout enough that the garbage at the `b_type` slot on the second hipblasLtMatmul call became exactly `2` (= `HIP_R_16F`), tripping the first arm of the `isValidOrderForDatatype` check. The bug is latent UB in hipBLASLt that has been present since the swizzle validation code was added. Release builds happened to leave benign values in those slots; Debug builds (and post-bump Release builds) do not. ## Root cause In `library/src/amd_detail/rocblaslt/src/include/rocblaslt_mat_utils.hpp:354-377`: ```cpp rocblaslt_status rocblaslt_matmul_valid_args(... hipDataType& a_type, // OUTPUT ... hipDataType& b_type, // OUTPUT ...) { hipblasOperation_t opA = matmul_descr->op_A; hipblasOperation_t opB = matmul_descr->op_B; auto matmul_swizzle_status = validateMatmulSwizzleArgs(matmul_descr, matA, matB, a_type, b_type, // ← uninitialized at this point swizzleA, swizzleB); if(matmul_swizzle_status != rocblaslt_status_continue) return matmul_swizzle_status; ... a_type = matA->type; // assignment happens after the validation call ... b_type = matB->type; ``` `validateMatmulSwizzleArgs` takes `a_type` / `b_type` by value (not by reference), so the read happens at the call site. The callee then runs: ```cpp if(swizzleB && !isValidOrderForDatatype(b_type, matB->order)) return rocblaslt_status_invalid_value; ``` `isValidOrderForDatatype` compares its `datatype` argument against `HIP_R_16F`, `HIP_R_16BF`, `HIP_R_8F_E4M3`, `HIP_R_8F_E4M3_FNUZ`, `HIP_R_4F_E2M1`. Any match against an uninitialized value that incidentally collides with one of these will cause a spurious `false` return whenever `matB->order` doesn't match the order required for that data type. Traced values from the failing run (second hipblasLtMatmul call): ``` matB->type=1000 (HIP_R_8F_E4M3_FNUZ, correct) matB->order=100 (HIPBLASLT_ORDER_COL16_4R16, correct for f8) b_type(arg)=2 (HIP_R_16F — uninitialized garbage) → check (HIP_R_16F && order != COL16_4R8 [=101]) → returns false → invalid_value ``` ## The fix ```diff + // FIX: a_type/b_type are output references and are uninitialized here; pass the + // actual matrix types so validateMatmulSwizzleArgs's isValidOrderForDatatype check + // operates on real values. auto matmul_swizzle_status - = validateMatmulSwizzleArgs(matmul_descr, matA, matB, a_type, b_type, swizzleA, swizzleB); + = validateMatmulSwizzleArgs(matmul_descr, matA, matB, matA->type, matB->type, swizzleA, swizzleB); ``` A cleaner long-term refactor would drop the `a_type` / `b_type` parameters from `validateMatmulSwizzleArgs` entirely and read `matA->type` / `matB->type` inside the function (it already receives both matrix layouts). ## Test plan - [x] Reproduce locally with a Debug HIP build at tip-of-develop in rocm-systems: 20/21 of the `matmul_swizzleB_f8_fnuz_*` tests fail on the original `--gtest_filter` from the CI failure. - [x] Minimum repro: `hipblaslt-test --gtest_filter="*pre_checkin_matmul_swizzleB_f8_fnuz_rf8_fnuz_rf16_rf16_rf32_r_TN_128_128_129_1_129_2440_0_128_128_1" --gtest_repeat=2` — first iteration PASS, second FAIL. - [x] Apply this fix → all 21 tests PASS, repeat=2 PASS/PASS. - [x] Re-verified with the rocm-systems bump (`315a69d`) restored on top — also passes. - [ ] Run the full hipblaslt-test suite to ensure no regressions. - [ ] Run TheRock multi-arch CI on the fix branch — should restore the previously-failing CI runs without needing the rocm-systems revert. ## Notes - This fix obsoletes any need to keep [rocm-systems revert PR ROCm#6861](ROCm/rocm-systems#6861) (`c528809566`). That PR worked around the symptom by re-shuffling stack layout; this PR fixes the actual UB. - Recommend running a sanitizer build (UBSan / MSan) over `library/src/amd_detail/rocblaslt/src/` to catch similar latent issues. MSan would have caught this immediately.

… workspace buffer(s) required by hipfft (ROCm#8338) ## Motivation In their early stages, the accuracy tests leverage a [check](https://github.com/ROCm/rocm-libraries/blob/15ab693dea8194ea743db85ff0d29730eef8ad84/projects/rocfft/shared/accuracy_test.h#L47) to assess whether the attempted test configuration's device footprint is expected to fit within bounds (device limits if no bounds are specified explicitly). Precise footprint assessment cannot work around actually creating the corresponding plan. While `rocfft_params` objects do so with minimal device allocations, similar logic is missing in `hipfft_params` at the moment. This may result in plan creation failures due to attempting to allocate an excessively large workspace (either internally to `hipfft` or externally, by `hipfft_params::set_externally_managed_work_areas()`), reported thereafter as test failures (instead of a test being skipped on the system). ## Technical Details - `vram_footprint()` now creates a temporary `hipfft_params` copy with a `vram_footprint_workspace_probe_mode` flag set, generates a plan without internal workspace allocation, and queries the required workspace size via `hipfftGetSize` instead of reading already-allocated buffer sizes. - An explicit copy constructor was added to `hipfft_params` to support the temporary-copy probing pattern; it copies transform configuration but resets plan handles and multi-GPU state. - `is_preventing_auto_allocation_at_generation()` and `need_separate_create_make()` both return `true` when in probe mode, ensuring the separate `hipfftCreate`/`hipfftSetAutoAllocation`/`hipfftMakePlan*` flow is always used during probing. - `set_externally_managed_work_areas()` is skipped when in probe mode, keeping workspace probing side-effect free. - The AMD hipFFT backend's `handle_exception` now maps `DEVICEBUF_MEM_USAGE` to `HIPFFT_ALLOC_FAILED` for consistent error reporting. ## Test Plan Current tests suffice. Test robustness on specific testing platforms should be improved. ## Test Result Tests pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

…els for K<512 with large free dim (ROCm#8604) ## Summary Adds subtile awareness to Origami's existing (CMS-style) heuristics database and an initial heuristic to skip subtile kernels in a regime where they are not competitive. Also includes the gfx950 BBS TN subtile library update ("updated lib") that introduces the subtile (`UseSubtileImpl`) solutions these heuristics act on. - Plumbs a `subtile` flag through the hipBLASLt prediction-library integration point: `UseSubtileImpl` (Contractions.py) -> `SizeMapping.useSubtileImpl` (C++ struct + serialization, `mapOptional` for back-compat) -> `origami config_t.subtile` in `Serialization/PredictionLibrary.hpp`. - Extends `heuristic_key_t` to match on `subtile`, and adds a `reject` param to `heuristic_params_t` that forces `compute_total_latency` to return max latency (`rank_configs` already drops max-latency configs). - Initial heuristic, scoped to **gfx950 / BF16 / TN** (a_transpose=T, b_transpose=N): reject subtile kernels when `K < 512` AND (`M > 1024` OR `N > 1024`). The OR over free dims is expressed as two entries that both set `reject`. ## Status I'm actively running experiments to refine this heuristic. This is the best version I've seen so far, but the exact thresholds/conditions may still change. ## Test plan - [ ] Build hipBLASLt from this branch (gfx950) so `UseSubtileImpl` propagates into `sizeMapping.useSubtileImpl`. - [ ] Confirm gfx950 BF16 TN subtile kernels with K<512 and a large free dim are not selected vs. baseline. - [ ] Spot-check that other arch/dtype/layouts, non-subtile kernels, and K>=512 are unaffected. Verified locally (origami unit-level, ANALYTICAL_GEMM_HEURISTICS=1): reject fires only for gfx950+BF16+TN+subtile+K<512+(M>1024 or N>1024); wrong arch/dtype/layout, non-subtile, K>=512, and small free dims are all kept. --------- Co-authored-by: smalekta <smalekta@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com>

…ox tests and api_tests for export surface (ROCm#8583) ## Summary - **OBJECT library pattern**: `stinkytofu_objs` compiles all sources once. Both `stinkytofu` (shared/static per `BUILD_SHARED_LIBS`) and `stinkytofu_static` (test-only twin) consume `$<TARGET_OBJECTS:stinkytofu_objs>` without recompilation. - **`stinkytofu_static`**: Created only when `BUILD_SHARED_LIBS=ON AND STINKYTOFU_BUILD_TESTS`. `unit_tests` links this so all symbols (including non-exported ones) are accessible for white-box coverage testing — no `STINKYTOFU_EXPORT` annotations needed on internal symbols. - **`api_tests`**: New test binary linking `stinkytofu` (shared). Tests every symbol used by rocisa and stinkytofu-opt. A linker failure here means a symbol was accidentally un-exported — catching regressions before they reach downstream consumers. - **Plugin isolation fix**: Plugin `.dll` links `stinkytofu.dll`; mixing with `stinkytofu_static` creates two registry instances. Plugin integration tests now live in `api_tests` (shared build) where registry is shared. - **Python staleness check scoped**: Staleness scan now covers only `src/` and `include/` — the directories actually compiled into `_stinkytofu.so`. New files in `tests/`, `tools/`, `examples/` no longer trigger a false stale warning. ## Architecture ``` stinkytofu_objs (OBJECT) ← compiled once, stinkytofu_EXPORTS defined │ ├─► stinkytofu (SHARED or STATIC) ← rocisa, stinkytofu-opt │ └─► stinkytofu_static (STATIC) ← unit_tests only, STINKYTOFU_STATIC defined (only created when BUILD_SHARED_LIBS=ON) api_tests ──► stinkytofu (shared) ← verifies export surface at link time unit_tests ──► stinkytofu_static ← white-box access to all symbols ``` ## Test plan - [x] `invoke coverage` passes with all unit_tests and api_tests green - [x] `PluginIntegrationTest` runs in api_tests (shared lib, shared registry) - [x] Static build (`BUILD_SHARED_LIBS=OFF`): `unit_tests` links `stinkytofu` directly, no `stinkytofu_static` created - [x] Python staleness check no longer fires on new test/tool/example files

…ding SR variant)

…t yaml Use pk8 instructions to convert 8 F32 values in a single instruction during global write, replacing 8 individual conversion calls: - v_cvt_scalef32_sr_pk8_fp8_f32 for FP8 with stochastic rounding - v_cvt_scalef32_pk8_fp8_f32 for FP8 without SR - v_cvt_scalef32_pk8_bf8_f32 for BF8 without SR Note: pk8 scale operand extracts FP32 exponent (2^(127-exp)), not a direct multiplier. Alpha is applied via v_mul_f32 with scale=1.0. Also includes: - Fix skipRearrangement when bias/activation/alpha modifies data - Add f8f8s_fuse_alpha_gfx1250.yaml with pk8 test coverage for FP8 SR, FP8 non-SR, BF8 non-SR, and bias+activation paths

Complex acc VGPRs interleave real/imag, so the relative offset elementSumIdx[i]-elementSumIdx[0] cannot locate the imag half. Disable skipRearrangement for complex to use the correct reorder path.

… SK5 (ROCm#8568) ## Motivation Switch gfx950 device library logic YAML files (excluding MX kernels) from SK3 to SK5 with default-OFF mode. SK5-OFF executes the SK3 code path at runtime but enables future hybrid-mode scheduling when turned ON. **Depends on:** PR ROCm#8162 (SK5 kernel infrastructure in TensileLite) ## Changes - gfx950 Equality, Origami, GridBased, and Range YAML files (519 files) updated: - `StreamK: 3` → `StreamK: 5` in solution parameters - `_SK3_` → `_SK5_` in kernel/solution name strings - No host-side code changes - No custom kernel assembly changes **MX kernels are NOT changed** — both F4/MXFP4 microscaling (`F4*_MXA32_MXB32`) and matrix-bias (`S_MX_B`) variants remain on SK3, as they do not yet support the SK4 path used by SK5 hybrid mode. ## Change Breakdown by Directory | Directory | Files | ~Lines Changed | Description | |-----------|-------|----------------|-------------| | Equality (gfx950) | 23 | ~46K | Core equality logic (excl. MX) | | Equality (gfx950_id75a3) | 23 | ~38K | id75a3 variant (excl. MX) | | Equality (gfx950_id75a8) | 3 | ~4K | id75a8 variant | | Origami (root) | 153 | ~113K | Origami base (excl. MX) | | Origami_nta4 | 153 | ~113K | Origami NTA4 (excl. MX) | | Origami_ntb4 | 153 | ~113K | Origami NTB4 (excl. MX) | | GridBased | 8 | ~2K | Grid-based solutions | | Range | 3 | 36 | Range-based solutions | | **Total** | **519** | **~430K** | | ## Split PRs for Review Since GitHub cannot render diffs for 519 files / 430K lines, the PR is split into 4 reviewable parts: | Part | Content | Files | PR | |------|---------|-------|----| | 1 | Equality + GridBased + Range | 60 | ROCm#8596 | | 2 | Origami base | 153 | ROCm#8597 | | 3 | Origami_nta4 | 153 | ROCm#8598 | | 4 | Origami_ntb4 | 153 | ROCm#8599 | All YAML changes are the same mechanical substitution (`StreamK: 3` → `5`, `_SK3_` → `_SK5_`). Reviewing Part 1 in detail is sufficient; Parts 2-4 are the same pattern applied to Origami files. ## Branch Verification The 4 split branches are merged into a verification branch. If the compare below shows **no diff**, the split is complete and correct: **[Compare: main branch vs merge of all 4 parts](ROCm/rocm-libraries@users/jolabega/sk5-device-library-default-off...users/jolabega/sk5-default-off-merge-verify)** ```mermaid graph LR develop[origin/develop] --> part1[Part 1: Equality] develop --> part2[Part 2: Origami root] develop --> part3[Part 3: Origami_nta4] develop --> part4[Part 4: Origami_ntb4] part1 --> merged[merge-verify branch] part2 --> merged part3 --> merged part4 --> merged merged -->|"compare shows no diff"| main[Main PR branch] ``` ## Test Plan - CI device library validation (`TensileLogic --check-all`) - `hipblaslt-test` pre_checkin suite on gfx950 ## Submission Checklist - [X] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. ## Risk level Low — changes are limited to device library logic YAML files (kernel selection metadata). No host code, no algorithm changes, no assembly changes. MX kernels excluded. **Associated ticket:** AIHPBLAS-3921

…n data (ROCm#8522) ## Summary The SDPA forward kernel supports an optional LSE (Log-Sum-Exp) statistics output tensor, but the golden data generator ignores the `--stats` flag and the C++ plan builder rejects any graph that requests it. This blocks backward golden data generation because the backward kernel requires LSE as an input tensor. This PR wires up LSE output end-to-end across the Python generator, provider C++, CPU reference executor, and integration tests. ## Risk Assessment Low risk. Changes are scoped to test infrastructure (golden data generator, CPU reference executor, test harness) and provider forward plan internals behind an existing `generate_stats` gate. No public API changes. Unit and integration tests pass with no regressions. ## ASIC Coverage Standard PR CI is sufficient. The changes affect test infrastructure (golden data generation, CPU reference execution) and provider-internal plan building behind an existing feature gate. No kernel selection, support surface, or default behavior changes. ## Testing Summary - Unit tests: 305/305 hip-kernel-provider tests pass with no regressions and backward compatibility verified. - Integration tests: 12/12 CPU golden reference tests pass across quick and standard tiers for BF16/FP16 with and without stats. - Cross-validation: AITER v3 forward cross-validation passed for all 6 BF16 bundles (max_abs < 0.001). - Backward compatibility: `--stats` omitted produces unchanged output (4 tensors, `generate_stats: null`). ## Testing Checklist - [x] hip-kernel-provider unit tests - `./build/bin/hip_kernel_provider_tests` - Status: Passed - [x] CPU golden reference integration tests - `hipdnn_integration_tests --gtest_filter="*CpuSdpaFwd*"` - Status: Passed - [x] AITER v3 cross-validation - 6 BF16 bundles, max_abs < 0.001 - Status: Passed - [ ] PR CI - GitHub PR checks - Status: Pending ## Technical Changes - Add `compute_lse()` to Python generator to produce reference LSE tensor (uid=4, shape `[B, H_q, S_q, 1]`, FP32) via `torch.logsumexp`, wired to `--stats` flag. - Fix FP16 dtype string (`"half"` not `"float16"`) and AITER version detection in generator; bump version to 1.0.1. - Add `lseUid` and `lseStrideHead` to `SdpaFwdParams` and wire conditional LSE pointer/stride logic in `SdpaFwdPlanBuilder` and `SdpaFwdPlan`. - Update CPU reference executor to accept and compute rank-4 LSE `[B, H, Sq, 1]` instead of rejecting stats graphs. - Introduce `PlanNotApplicableException` typed exception for structured error handling when a CPU plan builder is not applicable. - Reorganize golden data bundles into `quick/` and `standard/` tiers with FP16 and stats variants (12 bundles total, DVC-pushed). --------- Co-authored-by: Claude Opus 4 <noreply@anthropic.com>

…ucket; fix Linux bindings preload (ROCm#8581) ## Summary Makes the dnn-benchmarking ROCm PyTorch setup work on gfx90a (MI200/MI210/MI250) and makes the bare-image build self-contained from a single input. Three changes: 1. **Nightly bucket** — `setup.sh` selected the `gfx90X-dcgpu` bucket, frozen at torch 2.12 / ROCm 7.12 (predates several ROCm SDK libraries, notably hipDNN). Point gfx90a at the current `gfx90a` bucket (torch 2.13 / ROCm 7.14, which ships hipDNN). 2. **`--gpu-arch` as single source of truth** — `setup.sh` now derives the HIP offload target from `--gpu-arch` and passes `-DGPU_TARGETS` to the hipDNN/provider cmake builds, and its torch-mode probe tolerates the GPU-less `import torch` SDK warning. A bare/GPU-less build (e.g. `docker build`) no longer needs external `GPU_TARGETS`/`AMDGPU_TARGETS`/`ROCM_SDK_TARGET_FAMILY` env or GPU-discovery shims. 3. **Linux bindings preload** — preload the base HIP runtime via `rocm_sdk` instead of the `hipdnn` shortname. ## Risk Assessment Low risk. Changes are confined to the dnn-benchmarking `setup.sh` and the hipDNN Python bindings' import-time preload list. gfx942/gfx950 bucket selection is unchanged. The new arch handling uses hipDNN's documented `-DGPU_TARGETS` interface and only affects the device-code builds (the nanobind bindings are host-only). All paths are verified end-to-end on MI210 in a clean Ubuntu 24.04 + Python 3.12 environment built with no ROCm-related env beyond `--gpu-arch`. The Windows preload list is unchanged (still names `hipdnn`) but was not exercised. ## Testing Summary - Full dnn-benchmarking suite from a clean Ubuntu 24.04 + Python 3.12 install following the documented manual steps, passing only `--gpu-arch` (no `GPU_TARGETS`/`ROCM_SDK_TARGET_FAMILY` env, no offload shims). Validates that setup.sh's `-DGPU_TARGETS` device codegen builds and runs on the MI210. - GPU/ROCm-marked subset, exercising hipDNN-backed execution through the engine plugins. - Import smoke with venv activation only (no `LD_LIBRARY_PATH` workarounds). - `get_torch_mode` final-line parsing unit-checked against warning-polluted stdout (incl. a warning with no trailing newline) and against a plain mode-only interpreter. ## Testing Checklist - [x] dnn-benchmarking suite (bare Ubuntu 24.04, `--gpu-arch` only) - `pytest -m "not cuda"` - ASICs: gfx90a - Status: Passed (815 passed, 8 skipped, 8 xfailed) - [x] GPU/ROCm subset - `pytest -m "gpu or rocm"` - ASICs: gfx90a - Status: Passed (55 passed) - [x] setup.sh unit tests (incl. existing-CUDA-venv rejection) - `pytest tests/unit/cli/test_setup_script.py` - Status: Passed (4 passed) - [x] Bindings import, venv-only - `python -c "import hipdnn_frontend"` - ASICs: gfx90a - Status: Passed - [ ] Windows ROCm-wheel import - Status: Not run - [ ] PR CI - GitHub PR checks - Status: Pending ## Technical Changes - `setup.sh` (bucket): map `gfx90a` to the `gfx90a` ROCm nightly bucket (current; ships hipDNN) instead of the frozen `gfx90X-dcgpu` family bucket. `gfx942`/`gfx950` keep their `-dcgpu` buckets. - `setup.sh` (arch ownership): resolve the GPU arch once (`--gpu-arch` or detection) and pass `-DGPU_TARGETS`/`-DAMDGPU_TARGETS` to the `build_hipdnn`/`build_provider` cmake invocations; export `PYTORCH_ROCM_ARCH` as belt-and-suspenders. The wheel SDK ships no `rocm_agent_enumerator`/`offload-arch` on PATH and the build may run with no GPU, so HIP cannot autodetect the offload arch — supplying it explicitly removes the need for caller-set env and shims. - `setup.sh` (torch-mode): `get_torch_mode()` prints the mode on its own final line and the caller reads only that last line, so the GPU-less `import torch` SDK warning no longer corrupts torch-mode detection (a plain mode-only interpreter still works). Removes the `ROCM_SDK_TARGET_FAMILY` workaround. - `hipdnn_frontend/__init__.py`: on Linux, preload the base HIP runtime (`amd_comgr`, `amdhip64`, `hiprtc`) instead of `hipdnn` (`rocm_sdk` resolves base ROCm SDK packages; the libraries prefix is already on `LD_LIBRARY_PATH`). Windows keeps the full list including `hipdnn`.

## Motivation Optimize pipeline V3 for gfx950 by enabling buffer load to lds (async pipeline) ## Technical Details - Add `Async` bool to `Problem` struct to enable async pipeline in existing one - Add `static_move_ys` to load transpose. This generates offset in assembly instructions saving registers - Add `is_valid` to `async_get_vectorized_elements`. Before hard coded to true. It allows to support padding - Remove unnecessary restrictions to `is_a_load_tr` and `is_b_load_tr` (wider use of lds load transpose on gfx950) - Integrate async support in existing V3 pipeline (avoid pipelines duplication) - Create policy to support both async and default cases. This could be used by any async pipeline (next steps) - Define `wg_attr_num_access` separately for A and B. This allows to optimize ds_read instruction width for cases when one matrix is transposed and the other is not. Before in such cases, `ds_read_b64` was used instead of `ds_read_b128` - Add test for V3 async. Currently only supporting cases with A and B having the same type ## Test Plan New test `test_ck_tile_gemm_pipeline_compv3_async` ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

…CPU probe + setup.ps1) (ROCm#8255) ## Summary Make the dnn-benchmark tool and its non-GPU test suite run on Windows. This replaces the Unix-only CPU-time probe with a cross-platform one, adds a Windows PowerShell setup script (`setup.ps1`), and updates platform-specific unit-test assumptions so the non-GPU suite passes on Windows. ## Risk Assessment Low risk. Changes are confined to the dnn-benchmark tool: a cross-platform rework of the CPU-time probe, an additive Windows setup script, and platform-aware test assertions. No public API, schema, or build-behavior changes; the non-GPU suite passes locally on Windows. ## Testing Summary - dnn-benchmark non-GPU unit/integration suite on Windows (CPython 3.12, CPU torch). - CLI entry-point smoke check. - `setup.ps1` run end-to-end into the Windows ROCm-wheel env. ## Testing Checklist - [x] dnn-benchmark non-GPU tests - `pytest -m "not gpu"` - Status: Passed (684 passed, 10 skipped) - [x] CLI smoke - `python -m dnn_benchmarking --help` - Status: Passed - [x] Windows setup script - `setup.ps1` against the ROCm-wheel env - Status: Passed - [ ] GPU tests - require the MIOpen provider engine plugins (Linux-only) - Status: Not run - [ ] PR CI - GitHub PR checks - Status: Pending ## Technical Changes - `metrics/host.py`: sample process CPU time via `os.times` — the cross-platform stdlib accessor (backed by `GetProcessTimes` on Windows) — so `CpuTimeProbe` works on Windows and the module no longer depends on the Unix-only `resource` module. `_process_cpu_times()` returns a plain `(user, kernel)` tuple that the probe consumes positionally. The probe wraps the whole benchmark loop and reports a per-iteration average, so `os.times`'s clock-tick resolution is immaterial; it degrades to `None` if `os.times` is unavailable. - `setup.ps1` (new): a Windows PowerShell analogue of `setup.sh`. Installs dnn-benchmark into a selected Python env, optionally builds hipDNN + Python bindings + the MIOpen provider from source (`-ForceBuild`) inside an MSVC vcvars64 + Windows SDK environment, wires the compiled bindings via a `.pth`, installs PyTorch per `-TorchMode` (`cpu`/`existing`/`none`), and verifies the result. setup.sh's venv management (`--reuse-venv`/`--workspace`) and ROCm/CUDA torch modes are intentionally omitted (not applicable on Windows). - Tests: make platform-specific unit tests Windows-aware — compare paths via `Path` / `as_posix()` / `str(Path(...))` rather than hard-coded POSIX separators, embed a TOML fixture path as POSIX, update `test_host.py` for the tuple-based CPU-time sample, and skip two inherently Unix-only tests (a `?`-in-filename sqlite URI case and a `+x` executable-bit case that `os.access(X_OK)` cannot model on Windows). --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

## Motivation Switch to dispatcher profiler for ck tile conv. ## Technical Details - Switch to dispatcher profiler for ck tile conv. - Drop profiler for experimental codegen - Minor fixes for bwd data printing - Minor fixes for 3d conv in dispatcher codegen ## Test Plan test_grouped_conv*tile ## Test Result Passed ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Ville Pietilä <> Co-authored-by: Ville Pietilä <188998872+vpietila-amd@users.noreply.github.com>

) ## Motivation The Batchnorm backward activation integration test for the mlops engine in the hip-kernel-provider was only testing the NCHW layout. However, the engine also supports NHWC, NCDHW, and NDWC layouts for this operation which are tested in the other batchnorm integration tests for the engine. ## Technical Details This patch expands the testing to cover this gap. ## Test Plan New tests are added to ` hip_kernel_provider_integration_tests` ## Test Result Test pass on MI210 ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

…OCm#8626) ## Motivation Update subtile heuristic to only restrict based on K<512.

[CI] Run hipSPARSELt when hipBLASLt subtree changes (ROCm#7514) Map projects/hipblaslt to also activate the sparselt optional matrix project so hipSPARSELt builds and tests in the blas TheRock job. Stop mutating module-level state in collect_projects_to_run (per-call deep copies) and update therock_matrix tests accordingly. Tracking: ROCM-25320 Fixes ROCm#7519 Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Yanyao Wang <yanywang@amd.com>

## Summary Enable the first multi-threaded / multi-stream nightly test for `hipblaslt-test`. Until now every test ran single-threaded (`threads:0, streams:0`), which left the concurrent-usage paths that real customers hit (JAX/XLA, MaxText, etc.) uncovered in CI. This PR turns on `{ threads: 3, streams: 3 }` in the shared `common_threads_streams` anchor and adds one nightly matmul entry (`matmul_multithread_stream`) that consumes it on gfx942 / gfx950. This is purely a test-data change. The test harness already exposes `threads` / `streams` arguments via `RUN_TEST_ON_THREADS_STREAMS` in `clients/common/include/hipblaslt_test.hpp`; the underlying concurrency correctness fixes landed in ROCm#5597 (`hipStreamSynchronize`, `hipMemcpyAsync`, mutex on the `memory_pool` singleton). ## Work tracking - Task (this PR): https://amd-hub.atlassian.net/browse/AIHPBLAS-1023 - Parent Epic: https://amd-hub.atlassian.net/browse/AIHPBLAS-1003 — "hipBLASLt Testing & Quality Improvements — Prevent Production Crashes" - Companion harness fix (merged): ROCm#5597 (AIHPBLAS-1431) - Plan / failure log: https://amd.atlassian.net/wiki/spaces/MLSE/pages/1524793988 (AMD-internal) - Customer-visible motivators (AMD-internal): SWDEV-576540 (JAX/XLA matmul segfault), SWDEV-565755 (JAX training docker blocker), AIXLA-208 (JAX free() segfault in ROCm 7.2) ## Tests **Paths** - `projects/hipblaslt/clients/tests/data/hipblaslt_common.yaml` — uncomments `{ threads: 3, streams: 3 }` in the `&common_threads_streams` anchor. No existing test on `develop` references this anchor, so the only behavioral effect is on the new entry below. - `projects/hipblaslt/clients/tests/data/matmul_gtest.yaml` — adds `matmul_multithread_stream` (nightly, gfx942/gfx950), mirroring `matmul_medium` with `threads_streams: *common_threads_streams` added. **Run** ``` ./hipblaslt-test --gtest_filter='*matmul_multithread_stream*' ``` **Results (post-ROCm#5597, threads:3 / streams:3)** - Targeted suite: 384 tests, ~32s, all PASSED. - Full nightly (`*matmul_test*`) on gfx942: 22,435 tests, ~51 min, all PASSED. **Why `threads:3, streams:3`** Matches the rocBLAS coverage convention. The `{ threads: 4, streams: 4 }` and `{ threads: 5, streams: 5 }` rows are intentionally left commented because they currently surface unresolved failures (`matmul_bad_arg`, several `matmul_heuristic_all_solutions` variants). Those failures need separate tracking before they can be enabled — see PR comment. ## Flags / guardrails - [x] N/A — no product code changed; new coverage is nightly only and limited to `gpu_arch: '9(42|50)'`. ## Adjacent tests considered - Other operations (groupedgemm, ext-API, smoke, rocroller) — deliberately out of scope for this first enable; can land incrementally under AIHPBLAS-1023. - Multi-device (`devices > 1`) — out of scope here; depends on multi-GPU CI capacity called out in AIHPBLAS-1003. - Higher concurrency (`threads/streams: 4` and `5`) — deferred behind known failures, see Tests section. ## Risk acceptance - [x] STANDARD — test-only, nightly category, gfx942/gfx950 only, no flag needed. ## Submission Checklist - [x] Looked over contributing guidelines (https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests). --------- Signed-off-by: pdhirajkumarprasad <dhirajp@amd.com> Co-authored-by: Tony Davis <tony.davis@amd.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>

## Motivation Selecting the tile size for attention algorithms. ## Technical Details Origami model for FA2. ## Test Plan Added additional tests for the attention model. ## Test Result All tests are passing ## Selection accuracy: 66.6% <= 8K problem sizes and 100% > 8K problem sizes ``` q_seq_len , kv_seq_len , head_dim , q_heads , kv_heads , batch , dtype , selected_block_m , selected_block_n 32 , 32 , 128 , 32 , 32 , 1 , bf16 , 32 , 32 64 , 64 , 128 , 32 , 32 , 1 , bf16 , 32 , 64 128 , 128 , 128 , 32 , 32 , 1 , bf16 , 64 , 64 256 , 256 , 128 , 32 , 32 , 1 , bf16 , 128 , 64 512 , 512 , 128 , 32 , 32 , 1 , bf16 , 128 , 64 1024 , 1024 , 128 , 32 , 32 , 1 , bf16 , 128 , 64 2048 , 2048 , 128 , 32 , 32 , 1 , bf16 , 128 , 64 4096 , 4096 , 128 , 32 , 32 , 1 , bf16 , 128 , 64 8192 , 8192 , 128 , 32 , 32 , 1 , bf16 , 128 , 64 ``` ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Ryan Swann <109695074+ryanswann-amd@users.noreply.github.com>

rocRoller was tested on every BLAS-group change (hipBLASLt, rocBLAS, etc.) because shared/rocroller mapped to the blas group and rocroller was in its projects_to_test. hipBLASLt no longer relies on rocRoller kernels, so this only added flaky/timeout-prone CI with no signal. Route shared/rocroller to its own rocroller group via additional_options with project_to_add="blas": rocRoller is now tested only when its own subtree changes, still builds under the BLAS umbrella, and merges into the blas job when a PR touches both. Also drop shared/rocroller from the "project: hipblaslt" labeler entry to match the new separation. Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Yanyao Wang <yanywang@amd.com>

ROCm#8205) ## Motivation The goal of this PR is to standardize the precision support reference page format across all components, while also reducing the maintenance of burden of having to manually update the YAML data file in https://rocm.docs.amd.com/en/latest/reference/precision-support.html ## Technical Details - Each component maintains its own YAML file which will be eventually used in https://rocm.docs.amd.com/en/latest/reference/precision-support.html - A new precision support reference page is introduced which will not override existing data type/precision support content; it will serve as the overview/summary that will be linked in the ROCm reference page ## Test Plan - Built locally, viewed each component manually ## Test Result  ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

## Motivation The existing integration test harness hard-coded a single verification path (CPU reference to golden comparison) with the load/compare logic scattered across `SetUp()`. This made it difficult to add new verification modes (GPU reference, engine-under-test) or onboard non-golden bundle formats. ALMIOPEN-1968 restructures the harness into a generic runner that separates discovery, loading, and verification so all three executor modes (CPU ref, GPU ref, engine) can share the same bundle infrastructure. ## Key changes - **Bundle discovery** (`BundleDiscovery.hpp`): walks an arbitrary directory tree, supports both the tiered golden layout and flat customer drops under one root - **`IntegrationTestBundle`** struct + `loadIntegrationTestBundle()`: consolidates the three-step pattern (`loadBundleMetadata` + `loadGraphAndTensors` + `extractAndClearOutputTensorData`) into a single call - **Pre-load checks** (`graphJsonParses`, `tensorDataPresent`): disambiguate malformed bundles (FAIL) from missing DVC data (SKIP), since `loadGraphAndTensors()` throws for both - **Rich failure reports**: tensor UID, shape, dtype, max abs/rel error vs tolerance, worst-element index with expected/actual values, and mismatch count - **Golden to generic rename**: `GoldenBundleDiscovery.hpp` to `BundleDiscovery.hpp`, `GoldenBundleRegistration.hpp` to `BundleRegistration.hpp`, `GoldenBundleLoadCheck.hpp` to `BundleLoadCheck.hpp` and all corresponding symbols — the infrastructure is bundle-generic, not golden-specific - **Metadata guards** (`applyMetadataGuards`): VRAM requirement and arch compatibility checks gate expensive tensor loads with early SKIP ## Testing - All 189 unit tests pass (169 green, 20 skipped for no GPU device) - `loadIntegrationTestBundle()` has direct unit tests: `LoadIntegrationTestBundlePopulatesAllFields` and `LoadIntegrationTestBundleThrowsOnMissingBin` - Discovery, load-check, and verification path tests all updated and passing --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

## Motivation Fix int8 GEMMs crashing with INTERNAL_ERROR when alpha is 1065353216 (the int32 bit pattern of float 1.0), and remove duplicated alpha/beta classification logic that let two code paths disagree. ## Technical Details Promote get_scalar_value_from_void_ptr to a shared helper so ConstructTensileProblem and updateTensileProblem both classify alpha/beta from the true storage type (alphaBetaType) instead of the matrix type / a magnitude double; also delete the dead get_alpha_beta_target_type. ## Test Plan Added regression test matmul_gemm_i8_dst_i32_alpha_float_bits (alpha = int32 bits of float 1.0 = 1065353216) on gfx942/gfx950; ran the complex and i8_dst_i32 suites. ## Test Result New test passes (4 cases); complex (2314) and i8_dst_i32 (188) suites pass — kernel selection succeeds for the int32 alpha that previously crashed. --------- Co-authored-by: Naveen Kumar Elumalai <nelumala@ctr2-alola-ctrl-01.amd.com>

…k-sparsity-merge # Conflicts: # projects/composablekernel/example/ck_tile/01_fmha/example_fmha_fwd.cpp # projects/composablekernel/example/ck_tile/01_fmha/fmha_fwd_runner.hpp # projects/composablekernel/test/ck_tile/fmha/test_fmha_fwd.cpp

jaopaulolc and others added 30 commits June 12, 2026 19:39

[hipBLASLt] Fix occupancy calculation in TensileLite (ROCm#8197)

855144f

[rocsparse] Reduce pre_checkin test times (ROCm#8321)

53d28e2

## Motivation Some rocSPARSE pre_checkin tests were outside our time constraints significantly. This PR reduces the pre_checkin test times for the routines: bsrsm, bsrgeam, bsrilu0, spsm_csr, spsm_coo, spmm_csr, and gtsv

[hipDNN][dnn-providers] Allow current tool version when different fro…

6c1e626

…m expected (ROCm#8161)

[hipBLASLt] Revert "Reduce CachingLibrary map lookup/write overhead (R…

990fefd

…OCm#7754)" (ROCm#8356) This reverts commit 4807f77. ## Motivation Original changes requires refactor to be reintroduced along with matching unit tests. ROCM-25647

brockhargreaves-amd and others added 27 commits June 18, 2026 15:14

[hipblaslt] Add v_cvt_scalef32 pk8 FP8/BF8 instruction support (inclu…

e8b7b1e

…ding SR variant)

[hipblaslt] Fix skipRearrangement for complex types

148d936

Complex acc VGPRs interleave real/imag, so the relative offset elementSumIdx[i]-elementSumIdx[0] cannot locate the imag half. Disable skipRearrangement for complex to use the correct reorder path.

[origami] Update subtile heuristic to only restrict based on K<512. (R…

0751e6f

…OCm#8626) ## Motivation Update subtile heuristic to only restrict based on K<512.

github-actions Bot added project: none github actions labels Jun 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge latest develop into meta/fmha-fwd-block-sparsity (resolve conflicts)#1

Merge latest develop into meta/fmha-fwd-block-sparsity (resolve conflicts)#1
poyenc wants to merge 5456 commits into
goldcoderZ:meta/fmha-fwd-block-sparsityfrom
poyenc:meta/fmha-fwd-block-sparsity-merge

poyenc commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

poyenc commented Jun 19, 2026

Summary

Resolution

Formatting

Verification

How to land

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants