Where
xrspatial/geotiff/_gpu_decode.py:1192-1196 inside _try_nvcomp_batch_decompress.
What
The batched host->device upload path computes per-tile offsets via a Python for loop:
comp_sizes_list = [len(t) for t in raw_tiles]
comp_offsets_h = np.zeros(n_tiles, dtype=np.int64)
for i in range(1, n_tiles):
comp_offsets_h[i] = comp_offsets_h[i - 1] + comp_sizes_list[i - 1]
total_comp = sum(comp_sizes_list)
The sibling batched D2H helper _batched_d2h_to_bytes at line 924 uses the vectorised form:
offsets = np.empty(len(d_tiles) + 1, dtype=np.int64)
offsets[0] = 0
np.cumsum(sizes, out=offsets[1:])
Both helpers compute the same prefix sum; aligning the decompress side keeps the codebase consistent and trims interpreter overhead.
Why it matters
Microbench on 1024 tiles with random sizes:
| Method |
Time (us) |
| Python for loop |
84 |
np.cumsum with out= |
21 |
3.9x speedup, but only ~60us per nvCOMP decompress call -- low enough that it does not show up as a perf bottleneck. The motivation is consistency with the sibling helper at line 924 and the existing np.cumsum(comp_sizes, out=offsets[1:]) pattern in _nvcomp_batch_compress at line 2572.
Suggested fix
comp_sizes_arr = np.fromiter((len(t) for t in raw_tiles), dtype=np.int64, count=n_tiles)
comp_offsets_h = np.zeros(n_tiles, dtype=np.int64)
if n_tiles > 1:
np.cumsum(comp_sizes_arr[:-1], out=comp_offsets_h[1:])
total_comp = int(comp_sizes_arr.sum())
Drops the Python loop and the second pass over comp_sizes_list for sum().
Severity
LOW. The fix is a few lines and aligns with the codebase's other batched-transfer helpers; the wall-time delta on a typical nvCOMP decode is below 0.1% of the kernel runtime.
Where
xrspatial/geotiff/_gpu_decode.py:1192-1196inside_try_nvcomp_batch_decompress.What
The batched host->device upload path computes per-tile offsets via a Python
forloop:The sibling batched D2H helper
_batched_d2h_to_bytesat line 924 uses the vectorised form:Both helpers compute the same prefix sum; aligning the decompress side keeps the codebase consistent and trims interpreter overhead.
Why it matters
Microbench on 1024 tiles with random sizes:
np.cumsumwithout=3.9x speedup, but only ~60us per nvCOMP decompress call -- low enough that it does not show up as a perf bottleneck. The motivation is consistency with the sibling helper at line 924 and the existing
np.cumsum(comp_sizes, out=offsets[1:])pattern in_nvcomp_batch_compressat line 2572.Suggested fix
Drops the Python loop and the second pass over
comp_sizes_listforsum().Severity
LOW. The fix is a few lines and aligns with the codebase's other batched-transfer helpers; the wall-time delta on a typical nvCOMP decode is below 0.1% of the kernel runtime.