Skip to content

Update guided matmul example#2762

Open
chuanyuf wants to merge 2 commits into
mainfrom
update_guided_matmul_example
Open

Update guided matmul example#2762
chuanyuf wants to merge 2 commits into
mainfrom
update_guided_matmul_example

Conversation

@chuanyuf

Copy link
Copy Markdown
Contributor

Existing Sample Changes

Description

Guided Debugging Sample updates, Updates for oneAPI 2026.0 tools and drivers
Update cmake to at least 3.5, update readme document, and ensure memory release after use.

No new functions added for the PRs.

Fixes Issue#

External Dependencies

List any external dependencies created as a result of this change.

Type of change

Please delete options that are not relevant. Add a 'X' to the one that is applicable.

  • [ x ] Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Implement fixes for ONSAM Jiras

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

  • [x ] Command Line
  • oneapi-cli
  • Visual Studio
  • Eclipse IDE
  • VSCode
  • When compiling the compliler flag "-Wall -Wformat-security -Werror=format-security" was used

chuanyuf added 2 commits June 25, 2026 07:47
2025.3 Guided Debugging Sample updates
Updates for oneAPI 2026.0 tools and drivers

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the guided matrix-multiplication debugging samples for Intel® oneAPI 2026.0 by refreshing documentation/tooling requirements, bumping the CMake minimum version, and adding USM deallocation to reduce leaks in the sample code.

Changes:

  • Add sycl::free(...) cleanup for USM allocations across multiple guided samples.
  • Update sample READMEs for oneAPI 2026.0 tool/driver versions and revise guided-debug instructions/output snippets.
  • Bump cmake_minimum_required from 3.4 to 3.5 for the affected samples.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
Tools/ApplicationDebugger/guided_matrix_mult_SLMSize/src/2_matrix_mul.cpp Adds USM frees after q.wait() for the “working” variant.
Tools/ApplicationDebugger/guided_matrix_mult_SLMSize/src/1_matrix_mul_SLM_size.cpp Adds USM frees after q.wait() for the SLM-size failure variant.
Tools/ApplicationDebugger/guided_matrix_mult_SLMSize/README.md Updates prerequisites, error messages, and guided-debug narrative for newer runtimes/tools.
Tools/ApplicationDebugger/guided_matrix_mult_SLMSize/CMakeLists.txt Raises CMake minimum version to 3.5.
Tools/ApplicationDebugger/guided_matrix_mult_RaceCondition/README.md Updates prerequisites and guided-debug instructions/output.
Tools/ApplicationDebugger/guided_matrix_mult_RaceCondition/CMakeLists.txt Raises CMake minimum version to 3.5.
Tools/ApplicationDebugger/guided_matrix_mult_InvalidContexts/src/2_matrix_mul.cpp Adds device enumeration output and USM frees.
Tools/ApplicationDebugger/guided_matrix_mult_InvalidContexts/src/1_matrix_mul_invalid_contexts.cpp Adds device enumeration/output, device-specific queue selection, and conditional free behavior for the tutorial.
Tools/ApplicationDebugger/guided_matrix_mult_InvalidContexts/README.md Significant refresh of tutorial steps, additional scenarios (ASAN/bonus), and updated prerequisites.
Tools/ApplicationDebugger/guided_matrix_mult_InvalidContexts/CMakeLists.txt Raises CMake minimum version to 3.5.
Tools/ApplicationDebugger/guided_matrix_mult_Exceptions/src/3_matrix_mul.cpp Adds USM frees after q.wait().
Tools/ApplicationDebugger/guided_matrix_mult_Exceptions/src/2_matrix_mul_multi_offload.cpp Adds USM frees after q.wait().
Tools/ApplicationDebugger/guided_matrix_mult_Exceptions/src/1_matrix_mul_null_pointer.cpp Adds USM frees after q.wait().
Tools/ApplicationDebugger/guided_matrix_mult_Exceptions/README.md Updates prerequisites and guided-debug narrative for newer runtime behavior.
Tools/ApplicationDebugger/guided_matrix_mult_Exceptions/CMakeLists.txt Raises CMake minimum version to 3.5.
Tools/ApplicationDebugger/guided_matrix_mult_BadBuffers/src/b2_matrix_mul_usm.cpp Adds USM frees after q.wait().
Tools/ApplicationDebugger/guided_matrix_mult_BadBuffers/src/b1_matrix_mul_null_usm.cpp Adds USM frees after q.wait().
Tools/ApplicationDebugger/guided_matrix_mult_BadBuffers/README.md Expands tutorial with device-side AddressSanitizer guidance and updates prerequisites.
Tools/ApplicationDebugger/guided_matrix_mult_BadBuffers/CMakeLists.txt Raises CMake minimum version to 3.5.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +95 to +99
#ifdef BAD_FREE
device selected_device = devices[0];
#else
device selected_device = devices[1];
#endif
Comment on lines +81 to 83
// Be very specific about the device to use.
queue q(devices[0]);

Comment on lines 66 to 81
property_list propList = property_list{property::queue::enable_profiling()};

std::vector<sycl::device> devices = sycl::device::get_devices();
cout << "Devices:" << std::endl;

for (size_t index = 0; index < devices.size(); index++){
std::string device_name = devices[index].get_info<sycl::info::device::name>();
std::string device_driver = devices[index].get_info<sycl::info::device::driver_version>();
std::string sycl_version = devices[index].get_info<sycl::info::device::version>();
std::string vendor = devices[index].get_info<sycl::info::device::vendor>();
std::string backend = devices[index].get_info<sycl::info::device::backend_version>();
std::cout << " [" << index << "] " << device_name << ", " << sycl_version << " [" << device_driver
<< "] " << backend << ", " << vendor << std::endl;
}

queue q(default_selector_v);

q.wait();

sycl::free(dev_a, q);
[opencl:gpu][opencl:2] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO [25.18.33578]
[opencl:cpu][opencl:3] Intel(R) OpenCL, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz OpenCL 3.0 (Build 0) [2023.16.7.0.21_160000]
```
> **Note:** If you have only one `[level_zero:gpu]` device listed, or the order is different from the above, the the main example below may not work. Try to follow through anyway, and then try the bonus sample at the end of this document, which should work no matter what system configuration.
### Identify the Problem without Code Inspection

You must have already built the [Unified Tracing and Profiling Tool](#getting-the-tracing-and-profiling-tool). Once you have built the utility, you can start it before your program (similar to using GBD).
You need to build the [Unified Tracing and Profiling Tool](#getting-the-tracing-and-profiling-tool) before completing this section. Once you have built the utility, you can start it before your program (similar to using GBD).
101 queue q2(devicecontext, selected_device);
102 float * dev_c = sycl::malloc_device<float>(M*P, q2);
```
As is hopefully obvious from the previous example, the problem is that we are trying to free memory allocated in SYCL queue `q2` that has a different device context fron SYCL queue `q`, even though under the covers they point to the same hardware device.
```

Similarly, we specify targeting the CPU, which sometimes can avoid problems in your code that are specific to offloading to the GPU.
Similarly, we an force the program to run on the CPU, which sometimes can avoid problems in your code that are specific to offloading to the GPU.
#### Debugging the Problem

Why did we try with multiple backends? If one had shown correct or incorrect results, and one had crashed, we might be facing a race condition that only occasionally manifests as something that goes terribly wrong. Or one of the backbends might have a bug. But here all three crash, so it's likely the program is doing something illegal to memory. The host CPU is a particularly good place to test for illegal memory accesses, because the CPU never allows pointers with an address within a few kilobytes of address 0x0, while this may be legally allocated memory on the GPU.
Why did we try with multiple backends? If one had shown correct or incorrect results, and one had crashed, we might be facing a race condition that only occasionally manifests when something goes terribly wrong. Or one of the backbends might have a bug while the others do not. But here all three crash, so it's likely the program is doing something illegal to memory. The host CPU is a particularly good place to test for illegal memory accesses, because the CPU never allows pointers with an address within a few kilobytes of address `0x0`, while this may be legally allocated memory on the GPU.
```

We used the form of `parallel_for` that takes the `nd_range`, which specifies the global iteration range (163850) and the local work-group size (10) like so: `nd_range<1>{{163850}, {10}}`. The first line above shows the workgroup size (`groupSizeX = 10 groupSizeY = 1 groupSizeZ = 1`), and the second shows how many total workgroups will be needed to process the global iteration range (`{16385, 1, 1}`).
At like 106 we used the form of `parallel_for` that takes the `nd_range`, which specifies the global iteration range (163850) and the local work-group size (10) like so: `nd_range<1>{{163850}, {10}}`. The first line above shows the workgroup size (`groupSizeX = 0xa groupSizeY = 0x1 groupSizeZ = 0x1`), and the second shows how many total workgroups will be needed to process the global iteration range (`{16385, 1, 1}`).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants