leet_gpu_solution is a growing collection of optimized CUDA implementations for challenges on https://leetgpu.com/challenges.
The project demonstrates step-by-step optimization from basic to optimized solutions, making it a practical learning resource for anyone studying GPU programming, CUDA performance, or Nsight profiling.
| Module | Description | Optimization Techniques | Notable Speedup vs. Baseline | Status |
|---|---|---|---|---|
convolution_1d |
Convolution of 1D vector with 1D kernel | Shared memory | 1.40× on input_size = 1,500,000, kernel = 2047 | ✅ Implemented |
convolution_2d |
Convolution of 2D matrix with 2D kernel | Shared memory, tiling, im2col | 52.0× on 3072×3072 image (31×31 kernel) | ✅ Implemented |
matrix_multiplication |
2D matrix × 2D matrix multiplication | Shared memory tiling, float4 vectorized IO, WMMA/Tensor Cores, loop unrolling | 4.08× on M = N = 2048 | ✅ Implemented |
matrix_transpose |
Transpose of a 2D matrix | Shared memory tiling, coalesced memory access, bank-conflict avoidance | 1.23× on M = N = 10 000 | ✅ Implemented |
quantized_matrix_multiplication |
INT8 quantized matrix × matrix multiplication | Shared memory tiling, INT4 packing, QuantizeMultiplier requantization, Tensor Core acceleration | 5.62× on M = N = 2048 | ✅ Implemented |
softmax |
Exponential normalization across vectors or matrix rows | Warp-level reduction, online softmax, shared memory | 2.06× on N = 1 048 576 | ✅ Implemented |
sparse_matrix_vector_multiplication |
Sparse matrix × dense vector multiplication | CSR / ELL / Block-ELL / Merge-Path formats, merge-path SpMV | 1.2x total spmv setup + runtime latency 316x spmv runtime latency on M = N = 4096 (60% sparsity) | ✅ Implemented |
vector_addition |
Sum of two vectors | Float4 access | 1.03× on 100 000 000 elements | ✅ Implemented |
vector_reduction |
Sum reduction over large arrays | Shared memory, warp shuffle, sequential access pattern, loop unrolling | 4.29× on N = 1 073 741 824 | ✅ Implemented |
NOTE: More detailed readme files are written in the folder of each problem
All latency values in this repository are collected using Nsight Compute (gpu__time_duration.sum), which reports pure GPU kernel execution time.
These measurements exclude host-side setup, CUDA context initialization, and driver launch overhead.
Therefore, the reported numbers represent kernel latency (device-only performance), not full end-to-end or cold-start inference latency.
- Benchmark
matrix_multiplicationandquantized_matrix_multiplicationagainst NVIDIA cuBLAS and CUTLASS kernels. - Evaluate both FP32 and INT8 performance on identical input sizes.
- Measure kernel-only time and end-to-end latency.
- Benchmark all
spmvvariants (CSR, Vector CSR, ELLPACK, Block-ELL, Merge-Path) against cuSPARSE reference implementations. - Separate setup cost (memory allocation, temp buffer query) from runtime cost (actual SpMV kernel).
- Current architecture: NVIDIA Ada (SM 89).