Skip to content

Preheat host memory for OpenCL read_host_from_device#21134

Merged
jenshannoschwalm merged 1 commit into
darktable-org:masterfrom
jenshannoschwalm:host_preheating
May 31, 2026
Merged

Preheat host memory for OpenCL read_host_from_device#21134
jenshannoschwalm merged 1 commit into
darktable-org:masterfrom
jenshannoschwalm:host_preheating

Conversation

@jenshannoschwalm

Copy link
Copy Markdown
Collaborator

As some some systems are very slow in clEnqueueReadImage() if the host memory is still cold we use linux specific madvise() for the host mem before.

  1. Logs telling about errors in madvise() or if in -d verbose -d opencl debugging mode.
  2. The preheating feature is currently enabled via the hidden opencl_preheated conf

@dllu as you did the initial work on #21069 would you be able to test this?

@karolherbst would you mind to review? I am a bit worried about the #define __USE_MISC i had to add here, otherwise only posix_advise() was available (habing less precise hints.)

I tested here on strix halo/fedora44 44 so lots of unified mem. No problems/instability, could not see performance changes both for rusticl and rocm.

@dllu

dllu commented May 25, 2026

Copy link
Copy Markdown

Seems good on my GB10. This also seems faster than my PR #21069 although they aren't rebased on the same parent commit so the benchmark runs aren't directly comparable.

Using the 61MP benchmark from https://darktable.info/performance/benchmarks-beispiele/benchmark/:

NVIDIA driver version 580.159.03

Builds:

Label Revision Version Extra runtime config
pr21134-base-gcc13-fast10-notebook e6d95e3808f31ef4277f11af92a197587847ebca 5.5.0+1404~ge6d95e3808-dirty none
pr21134-head-preheated-gcc13-fast10-notebook a589317b9168073f0960e0f878934195216fe220 5.5.0+1405~ga589317b91-dirty --conf opencl_preheated=true

Run controls:

  • BENCH_RESOURCELEVEL=default, BENCH_MODES=opencl.
  • OMP_NUM_THREADS=10, OMP_PROC_BIND=close,
    OMP_PLACES={5},{6},{7},{8},{9},{15},{16},{17},{18},{19}.
  • BENCH_CPUSET=5-9,15-19.
  • PR HEAD explicitly passes --conf opencl_preheated=true.
  • Logs confirm opencl_scheduling_profile: 'default' and OpenCL compiler
    option -cl-fast-relaxed-math.

Measured pixelpipe times:

Mode Label Trial 1 Trial 2 Trial 3 Mean
OpenCL fast, default resource pr21134-base-gcc13-fast10-default 23.387 s 22.732 s 20.457 s 22.192000 s
OpenCL fast, default resource pr21134-head-preheated-gcc13-fast10-default 6.107 s 6.105 s 6.097 s 6.103000 s

OpenCL image readback profiling:

Label Trial 1 Trial 2 Trial 3 Mean
pr21134-base-gcc13-fast10-default 17.4780 s 16.7695 s 14.3167 s 16.188067 s
pr21134-head-preheated-gcc13-fast10-default 0.0637 s 0.0599 s 0.0620 s 0.061867 s

OpenCL command queue totals:

Label Trial 1 Trial 2 Trial 3 Mean
pr21134-base-gcc13-fast10-default 19.1565 s 18.5736 s 16.0636 s 17.931233 s
pr21134-head-preheated-gcc13-fast10-default 1.7648 s 1.7260 s 1.7480 s 1.746267 s

notebook tiling

By the way, in #21069 (comment) you also mentioned using the "notebook" resource level. But it's a LOT slower.

Run controls:

  • GCC 13.3.0 builds with darktable's git-checkout -Werror injection removed
    in both benchmark worktrees.
  • BENCH_RESOURCELEVEL=notebook, BENCH_MODES=opencl.
  • OMP_NUM_THREADS=10, OMP_PROC_BIND=close,
    OMP_PLACES={5},{6},{7},{8},{9},{15},{16},{17},{18},{19}.
  • BENCH_CPUSET=5-9,15-19.
  • The local release tag available to version generation is release-5.5.0.

Measured pixelpipe times:

Mode Label Trial 1 Trial 2 Trial 3 Mean
OpenCL fast, notebook pr21134-base-gcc13-fast10-notebook 42.257 s 46.157 s 45.923 s 44.779000 s
OpenCL fast, notebook pr21134-head-preheated-gcc13-fast10-notebook 15.920 s 15.839 s 15.560 s 15.773000 s

OpenCL image readback profiling:

Label Trial 1 Trial 2 Trial 3 Mean
pr21134-base-gcc13-fast10-notebook 28.1682 s 32.0609 s 31.1227 s 30.450600 s
pr21134-head-preheated-gcc13-fast10-notebook 0.3928 s 0.3744 s 0.3582 s 0.375133 s

OpenCL command queue totals:

Label Trial 1 Trial 2 Trial 3 Mean
pr21134-base-gcc13-fast10-notebook 30.2408 s 34.1470 s 33.2703 s 32.552700 s
pr21134-head-preheated-gcc13-fast10-notebook 2.5552 s 2.5130 s 2.4258 s 2.498000 s

Anyway, seems good. Ship it!!!!!!!

@jenshannoschwalm jenshannoschwalm added the OpenCL Related to darktable OpenCL code label May 26, 2026
@jenshannoschwalm jenshannoschwalm marked this pull request as draft May 26, 2026 04:52
@jenshannoschwalm

Copy link
Copy Markdown
Collaborator Author

Ok, so

  1. We understood the original issue and currently think it's a problem that should be fixed by the vendor
  2. If we really want host mem preheating that should be refactored and made available not locally in OpenCL code but via darktable.h

@jenshannoschwalm jenshannoschwalm added bug: upstream he bug needs a fix outside of the scope of darktable, in an external lib or in a driver Nvidia OpenCL labels May 27, 2026
As some some systems are very slow in `clEnqueueReadImage()` if the host memoty is still cold
we use linux specific `madvise()` for the host mem before.

1. Logs telling about errors in madvise() or if in -d verbose -d opencl debugging mode.
2. The preheating feature is currently enabled via the hidden `opencl_preheated` conf
@dllu

dllu commented May 30, 2026

Copy link
Copy Markdown

Even with this branch, I find that using darktable is extremely sluggish on NVIDIA GB10. Even basic operations like changing the exposure takes like 15 seconds of "working...".

I have the following settings.

image image

Any ideas what else I need?

@jenshannoschwalm

Copy link
Copy Markdown
Collaborator Author

In fact - no. If you provide a "-d opencl -d pipe" log when you observe this - maybe i spot something "picky".
BTW there is also cl_mem buffer reading to non-heated host mem.

@jenshannoschwalm

Copy link
Copy Markdown
Collaborator Author

By the way, in #21069 (comment) you also mentioned using the "notebook" resource level. But it's a LOT slower.

That would be pinned/mapped mode and lots of mem transfers.

@dllu

dllu commented May 31, 2026

Copy link
Copy Markdown

Never mind, I was dumb, I didn't turn on the opencl_preheated=true option in my ~/.config/darktable/darktablerc. I didn't realize that you had gated the behavior behind an option. It's snappy now.

Before:

   75.0833 pipe starting             CL0 [full]                                        (2213/491)  2754x3710 sc=1.000; '2026-05-30-11-38-10_DSCF9401_580438e9506597123706c69b20336e7e6476088c.raf' ID=7, nvidiacudanvidiagb10 using 26852MB
    75.0833 modified roi IN               [full]           crop                   2900  (2213/491)  2754x3710 sc=1.000 --> (4665/2290)  2754x3710 sc=1.000; ID=7
    75.0833 modified roi IN               [full]           highlights              500 (4665/2290)  2754x3710 sc=1.000 -->       (0/0) 11648x8736 sc=1.000; ID=7
    75.0833 modified roi IN               [full]           rawprepare              100       (0/0) 11648x8736 sc=1.000 -->       (0/0) 11808x8754 sc=1.000; ID=7
    75.0833 pipe data: full               [full]                                             (0/0) 11808x8754 sc=1.000;
    75.1384 process                   CL0 [full]           rawprepare              100       (0/0) 11808x8754 sc=1.000 -->       (0/0) 11648x8736 sc=1.000; IOP_CS_RAW 820.5MB
    75.2011 process                   CL0 [full]           temperature             300       (0/0) 11648x8736 sc=1.000; IOP_CS_RAW 814.1MB
    75.2339 process                   CL0 [full]           highlights              500       (0/0) 11648x8736 sc=1.000 --> (4665/2290)  2754x3710 sc=1.000; IOP_CS_RAW 1017.6MB
    75.6777 process                   CPU [full]           hotpixels               700 (4665/2290)  2754x3710 sc=1.000; IOP_CS_RAW 82MB
    75.6959 process                   CL0 [full]           demosaic                900 (4665/2290)  2754x3710 sc=1.000; IOP_CS_RAW -> IOP_CS_RGB 327.0MB
    75.8129 process                   CL0 [full]           denoiseprofile         1000 (4665/2290)  2754x3710 sc=1.000; IOP_CS_RGB 1716.5MB
    75.9823 process                   CL0 [full]           exposure               2500 (4665/2290)  2754x3710 sc=1.000; IOP_CS_RGB 327.0MB
    78.6464 importance hints          CL0 [full]           exposure               2500 (4665/2290)  2754x3710 sc=1.000; focus important_in
    78.6464 process                   CL0 [full]           crop                   2900 (4665/2290)  2754x3710 sc=1.000 -->  (2213/491)  2754x3710 sc=1.000; IOP_CS_RGB 327.0MB
    78.6582 process                   CL0 [full]           colorin                3300  (2213/491)  2754x3710 sc=1.000; IOP_CS_RGB -> IOP_CS_LAB 327.0MB
    78.6582 coeff correction          CL0 [full]           colorin                3300  (2213/491)  2754x3710 sc=1.000; `standard color matrix' 2.009(*1.178) 1.000(*1.000) 1.450(*0.849)
    78.6736 transform colorspace      CL0 [full]           channelmixerrgb        3400  (2213/491)  2754x3710 sc=1.000; IOP_CS_LAB -> IOP_CS_RGB `linear Rec2020 RGB'
    78.6830 process                   CL0 [full]           channelmixerrgb        3400  (2213/491)  2754x3710 sc=1.000; IOP_CS_RGB 327.0MB
    78.7004 transform colorspace      CL0 [full]           sharpen                4600  (2213/491)  2754x3710 sc=1.000; IOP_CS_RGB -> IOP_CS_LAB `linear Rec2020 RGB'
    78.7052 process                   CL0 [full]           sharpen                4600  (2213/491)  2754x3710 sc=1.000; IOP_CS_LAB 490.4MB
    78.7384 transform colorspace      CL0 [full]           sigmoid                5900  (2213/491)  2754x3710 sc=1.000; IOP_CS_LAB -> IOP_CS_RGB `linear Rec2020 RGB'
    78.7490 process                   CL0 [full]           sigmoid                5900  (2213/491)  2754x3710 sc=1.000; IOP_CS_RGB 327.0MB
    78.7663 transform colorspace      CL0 [full]           colorcontrast          7200  (2213/491)  2754x3710 sc=1.000; IOP_CS_RGB -> IOP_CS_LAB `linear Rec2020 RGB'
    78.7713 process                   CL0 [full]           colorcontrast          7200  (2213/491)  2754x3710 sc=1.000; IOP_CS_LAB 327.0MB
    78.7859 process                   CL0 [full]           colorout               8600  (2213/491)  2754x3710 sc=1.000; IOP_CS_LAB -> IOP_CS_RGB 327.0MB
    81.5115 importance hints          CL0 [full]           colorout               8600  (2213/491)  2754x3710 sc=1.000; important_in
    84.1841 process                   CPU [full]           gamma                  9300  (2213/491)  2754x3710 sc=1.000; IOP_CS_RGB 327MB
    84.2026 cache report                  [full]                                        64 lines (important=21, used=40, invalid=9). Using 4533MB, limit=3894MB. Hits/run=0.50. Hits/test=0.043
    84.2026 pipe finished             CL0 [full]                                        (2213/491)  2754x3710 sc=1.000; '2026-05-30-11-38-10_DSCF9401_580438e9506597123706c69b20336e7e6476088c.raf' ID=7

After:

    14.0441 pipe starting             CL0 [full]                                         (1538/69)  2754x3710 sc=0.768; '2026-05-30-11-38-10_DSCF9401_580438e9506597123706c69b20336e7e6476088c.raf' ID=7, nvidiacudanvidiagb10 using 26852MB
    14.0441 modified roi IN               [full]           crop                   2900   (1538/69)  2754x3710 sc=0.768 --> (3422/1451)  2754x3710 sc=0.768; ID=7
    14.0441 modified roi IN               [full]           demosaic                900 (3422/1451)  2754x3710 sc=0.768 --> (4453/1888)  3584x4828 sc=1.000; ID=7
    14.0441 modified roi IN               [full]           highlights              500 (4453/1888)  3584x4828 sc=1.000 -->       (0/0) 11648x8736 sc=1.000; ID=7
    14.0441 modified roi IN               [full]           rawprepare              100       (0/0) 11648x8736 sc=1.000 -->       (0/0) 11808x8754 sc=1.000; ID=7
    14.0441 pipe data: full               [full]                                             (0/0) 11808x8754 sc=1.000;
    14.0999 process                   CL0 [full]           rawprepare              100       (0/0) 11808x8754 sc=1.000 -->       (0/0) 11648x8736 sc=1.000; IOP_CS_RAW 820.5MB
    14.1643 process                   CL0 [full]           temperature             300       (0/0) 11648x8736 sc=1.000; IOP_CS_RAW 814.1MB
    14.1965 process                   CL0 [full]           highlights              500       (0/0) 11648x8736 sc=1.000 --> (4453/1888)  3584x4828 sc=1.000; IOP_CS_RAW 1017.6MB
    14.2268 process                   CPU [full]           hotpixels               700 (4453/1888)  3584x4828 sc=1.000; IOP_CS_RAW 138MB
    14.2530 process                   CL0 [full]           demosaic                900 (4453/1888)  3584x4828 sc=1.000 --> (3422/1451)  2754x3710 sc=0.768; IOP_CS_RAW -> IOP_CS_RGB 440.3MB
    14.4501 process                   CL0 [full]           denoiseprofile         1000 (3422/1451)  2754x3710 sc=0.768; IOP_CS_RGB 1716.5MB
    14.6115 process                   CL0 [full]           exposure               2500 (3422/1451)  2754x3710 sc=0.768; IOP_CS_RGB 327.0MB
    14.6584 importance hints          CL0 [full]           exposure               2500 (3422/1451)  2754x3710 sc=0.768; focus important_in
    14.6585 process                   CL0 [full]           crop                   2900 (3422/1451)  2754x3710 sc=0.768 -->   (1538/69)  2754x3710 sc=0.768; IOP_CS_RGB 327.0MB
    14.6703 process                   CL0 [full]           colorin                3300   (1538/69)  2754x3710 sc=0.768; IOP_CS_RGB -> IOP_CS_LAB 327.0MB
    14.6704 coeff correction          CL0 [full]           colorin                3300   (1538/69)  2754x3710 sc=0.768; `standard color matrix' 2.009(*1.178) 1.000(*1.000) 1.450(*0.849)
    14.6842 transform colorspace      CL0 [full]           channelmixerrgb        3400   (1538/69)  2754x3710 sc=0.768; IOP_CS_LAB -> IOP_CS_RGB `linear Rec2020 RGB'
    14.6887 process                   CL0 [full]           channelmixerrgb        3400   (1538/69)  2754x3710 sc=0.768; IOP_CS_RGB 327.0MB
    14.7021 transform colorspace      CL0 [full]           sharpen                4600   (1538/69)  2754x3710 sc=0.768; IOP_CS_RGB -> IOP_CS_LAB `linear Rec2020 RGB'
    14.7111 process                   CL0 [full]           sharpen                4600   (1538/69)  2754x3710 sc=0.768; IOP_CS_LAB 490.4MB
    14.7361 transform colorspace      CL0 [full]           sigmoid                5900   (1538/69)  2754x3710 sc=0.768; IOP_CS_LAB -> IOP_CS_RGB `linear Rec2020 RGB'
    14.7452 process                   CL0 [full]           sigmoid                5900   (1538/69)  2754x3710 sc=0.768; IOP_CS_RGB 327.0MB
    14.7554 transform colorspace      CL0 [full]           colorcontrast          7200   (1538/69)  2754x3710 sc=0.768; IOP_CS_RGB -> IOP_CS_LAB `linear Rec2020 RGB'
    14.7632 process                   CL0 [full]           colorcontrast          7200   (1538/69)  2754x3710 sc=0.768; IOP_CS_LAB 327.0MB
    14.7752 process                   CL0 [full]           colorout               8600   (1538/69)  2754x3710 sc=0.768; IOP_CS_LAB -> IOP_CS_RGB 327.0MB
    14.8274 importance hints          CL0 [full]           colorout               8600   (1538/69)  2754x3710 sc=0.768; important_in
    14.8660 process                   CPU [full]           gamma                  9300   (1538/69)  2754x3710 sc=0.768; IOP_CS_RGB 327MB
    14.8832 cache report                  [full]                                        64 lines (important=12, used=27, invalid=9). Using 3722MB, limit=3894MB. Hits/run=0.00. Hits/test=0.000
    14.8832 pipe finished             CL0 [full]                                         (1538/69)  2754x3710 sc=0.768; '2026-05-30-11-38-10_DSCF9401_580438e9506597123706c69b20336e7e6476088c.raf' ID=7

About 10 seconds --> 0.8 second. This PR does in fact fix the problem and I'll be using it on my DGX Spark from now on.

@jenshannoschwalm

Copy link
Copy Markdown
Collaborator Author

Fine. You might still want to check if buffer reading should use the same preheating. That would be dt_opencl_read_buffer_from_device().

Would be good to know so we can possibly backreport to nvidia folks.
Good candidate to test would be denoise profile or masks using the "details" stuff.

@jenshannoschwalm jenshannoschwalm marked this pull request as ready for review May 31, 2026 18:03
@jenshannoschwalm jenshannoschwalm merged commit 475a9f7 into darktable-org:master May 31, 2026
3 of 5 checks passed
@jenshannoschwalm

Copy link
Copy Markdown
Collaborator Author

@TurboGit would you revert this right now please! Nothing for master in this form... I didnt notice i pushed on any merge button and wasnt yet aware of my rights to do so.

@jenshannoschwalm

Copy link
Copy Markdown
Collaborator Author

@victoryforce TIA

@victoryforce

Copy link
Copy Markdown
Collaborator

@jenshannoschwalm Just above this, in the same line where the information that you "merged commit 475a9f7 into darktable-org:master" is, there is a "Revert" button. You definitely have the rights to do this. Maybe it would be better if you did the revert yourself, so that we don't doubt whether we understood you correctly?

@jenshannoschwalm

Copy link
Copy Markdown
Collaborator Author

Did so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug: upstream he bug needs a fix outside of the scope of darktable, in an external lib or in a driver Nvidia OpenCL OpenCL Related to darktable OpenCL code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants