Skip to content

Conversation

@SanityRemnants
Copy link
Contributor

Added -fno-reciprocal-math to SYCL_KERNEL_OPTIONS to eliminate issues with division of the number by itself returning non 1 output which caused: #1895.

Copilot AI review requested due to automatic review settings November 26, 2025 08:28
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a division accuracy issue in SYCL kernels where dividing a number by itself could return non-1 results. The fix adds the -fno-reciprocal-math compiler flag to disable reciprocal math optimizations that can introduce numerical inaccuracies.

Key Changes:

  • Added -fno-reciprocal-math to SYCL_KERNEL_OPTIONS to prevent compiler optimizations that replace division operations with multiplication by reciprocals

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@riverliuintel
Copy link
Contributor

@CuiYifeng

@SanityRemnants
Copy link
Contributor Author

Currently evaluating the use of a more localized approach using #pragma clang fp reciprocal(off) to minimize the impact on performance. Do not merge for now.

# gcc -shared host.o kernel.o device-code.o -o libxxx.so
set(SYCL_KERNEL_OPTIONS ${SYCL_KERNEL_OPTIONS} -fno-sycl-unnamed-lambda)
set(SYCL_KERNEL_OPTIONS ${SYCL_KERNEL_OPTIONS} -sycl-std=2020)
set(SYCL_KERNEL_OPTIONS ${SYCL_KERNEL_OPTIONS} -fno-reciprocal-math)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's study the NVCC behaviors first. I think reciprocal-math is a general optimization, and most the compilers enable it by default.

Copy link
Contributor Author

@SanityRemnants SanityRemnants Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EikanWang I believe --prec-div https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html?highlight=reciprocal#prec-div-true-false-prec-div is the corresponding NVCC flag by default set to True unless --use_fast_math (https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html?highlight=reciprocal#use-fast-math-use-fast-math) is set to True. I do not believe pytorch is built with --use_fast_math correct me if I'm wrong.

@SanityRemnants
Copy link
Contributor Author

SanityRemnants commented Dec 1, 2025

There are a few options regarding #pragma clang fp reciprocal(off) (https://clang.llvm.org/docs/LanguageExtensions.html#extensions-to-specify-floating-point-flags) with the least invasive being adding it only to the BinaryDivTrueKernel.cpp file. However, I'm not sure whether it's worth it to sacrifice some performance for accuracy in this oddly specific use case (calling div first and than trunc instead of div_trunc) for which we have no known business need outside the failing UT. Both the div and div_trunc results are within needed norms. It might be better to modify the UT. What is your opinion on that @EikanWang?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants