A simple and easy-to-use library for GPU computing in Fortran, providing transparent access to GPU acceleration through a clean Fortran interface.
Online API Documentation - Complete API reference generated with Doxygen
Simple GPU is a library designed to simplify GPU computing in Fortran applications. It provides:
- Dual implementation: CPU-only version using standard BLAS, and GPU-accelerated version using NVIDIA cuBLAS or AMD hipBLAS
- Transparent interface: Same Fortran API for both CPU and GPU versions
- Memory management: Easy GPU memory allocation and data transfer
- BLAS operations: Common BLAS operations (GEMM, GEMV, DOT, GEAM) for both single and double precision
- Stream support: Asynchronous operations through CUDA or HIP streams
gpu_allocate: Allocate memory on GPU (or CPU for CPU version)gpu_deallocate: Free allocated memorygpu_upload: Transfer data from CPU to GPUgpu_download: Transfer data from GPU to CPUgpu_copy: Copy data between GPU memory regions
gpu_ndevices: Query number of available GPU devicesgpu_set_device: Select active GPU devicegpu_get_memory: Query GPU memory status
All BLAS operations have variants that accept 64-bit integers for dimensions. These variants have a _64 suffix (e.g., gpu_ddot_64, gpu_dgemm_64).
gpu_sdot,gpu_ddot: Dot product (single/double precision)
gpu_sgemv,gpu_dgemv: Matrix-vector multiplication
gpu_sgemm,gpu_dgemm: Matrix-matrix multiplicationgpu_sgeam,gpu_dgeam: Matrix addition/transposition
- Streams: Create and manage CUDA streams for asynchronous execution
- BLAS handles: Manage cuBLAS library handles
- Stream synchronization: Control execution flow
- C compiler (gcc, clang, etc.)
- Fortran compiler (gfortran, ifort, etc.)
- BLAS library (OpenBLAS, Intel MKL, or reference BLAS)
- Autotools (autoconf, automake, libtool)
- NVIDIA CUDA Toolkit (with nvcc compiler)
- NVIDIA cuBLAS library
- CUDA-capable GPU
- AMD ROCm platform
- hipBLAS library
- ROCm-capable GPU
-
Generate configure script (if building from git):
./autogen.sh
-
Configure the build:
./configure
The configure script will automatically detect if CUDA is available and enable the NVIDIA GPU library if possible.
Configuration options:
--disable-nvidia: Disable NVIDIA GPU library even if CUDA is available--with-cuda=DIR: Specify CUDA installation directory (default:/usr/local/cuda)--disable-amd: Disable AMD GPU library even if ROCm is available--with-rocm=DIR: Specify ROCm installation directory (default: auto-detect)--with-blas=LIB: Specify BLAS library to use
Example configurations:
# CPU version only ./configure --disable-nvidia --disable-amd # Specify CUDA location ./configure --with-cuda=/opt/cuda # Specify ROCm location ./configure --with-rocm=/opt/rocm # Use specific BLAS library ./configure --with-blas="-lmkl_rt"
-
Build the libraries:
make
-
Run tests (optional):
make check
-
Install:
sudo make install
Simple GPU provides three shared libraries:
-
libsimple_gpu-cpu.so: CPU-only version
- Uses standard BLAS library
- No GPU required
- Useful for development and testing on systems without GPUs
-
libsimple_gpu-nvidia.so: NVIDIA GPU version (if CUDA is available)
- Uses NVIDIA cuBLAS library
- Requires CUDA-capable GPU
- Provides GPU acceleration for supported operations
-
libsimple_gpu-amd.so: AMD GPU version (if ROCm is available)
- Uses AMD hipBLAS library
- Requires ROCm-capable GPU
- Provides GPU acceleration for supported operations
All libraries provide the same Fortran interface, allowing seamless switching between CPU and GPU implementations.
Important: To use the Simple GPU library in your Fortran project, you must create a file with the .F90 extension (uppercase) that includes the library header using the C preprocessor:
#include <simple_gpu.F90>The .F90 extension (uppercase) is critical because:
-
Preprocessor Support: Fortran compilers only run the C preprocessor on files with uppercase extensions (
.F90,.F,.FOR). The lowercase.f90extension bypasses the preprocessor, so the#includedirective won't work. -
Cross-Compiler Compatibility: The
#include <simple_gpu.F90>directive allows the C preprocessor to find the header file in a default location (viaCPATHenvironment variable). This means:- When you update the library, no changes are needed in your code
- The library can be compiled once with one compiler (e.g., gcc/gfortran) and used with any other Fortran compiler (ifort, nvfortran, etc.)
- No Fortran
.modfiles are distributed, which are notoriously incompatible between different compilers
-
Simplified Distribution: By avoiding compiler-specific
.modfiles, simple_gpu maintains maximum portability across different Fortran compiler ecosystems.
Example of a minimal program:
#include <simple_gpu.F90>
program my_program
use gpu
implicit none
! Your code here
end program my_programThe library provides multidimensional array types for both single and double precision:
gpu_double1: 1-dimensional array of double precision valuesgpu_double2: 2-dimensional array of double precision valuesgpu_double3: 3-dimensional array of double precision valuesgpu_double4,gpu_double5,gpu_double6: 4, 5, and 6-dimensional arrays
Similarly for single precision:
gpu_real1throughgpu_real6
Each type contains:
c: C pointer to GPU memoryf: Fortran pointer for accessing data (e.g.,f(:)for 1D,f(:,:)for 2D)
The gpu_allocate function is overloaded and automatically accepts the appropriate number of dimensions:
call gpu_allocate(x, n) ! 1D array: x is gpu_double1 or gpu_real1
call gpu_allocate(a, m, n) ! 2D array: a is gpu_double2 or gpu_real2
call gpu_allocate(b, l, m, n) ! 3D array: b is gpu_double3 or gpu_real3#include <simple_gpu.F90>
program example
use gpu
implicit none
type(gpu_blas) :: handle
type(gpu_double1) :: x, y
double precision, allocatable :: x_h(:), y_h(:)
double precision :: result
integer :: n, i
n = 1000
! Initialize BLAS handle
call gpu_blas_create(handle)
! Allocate vectors (1D arrays)
call gpu_allocate(x, n)
call gpu_allocate(y, n)
! Create and initialize host data
allocate(x_h(n), y_h(n))
do i = 1, n
x_h(i) = dble(i)
y_h(i) = dble(i) * 2.0d0
end do
! Upload to GPU
call gpu_upload(x_h, x)
call gpu_upload(y_h, y)
! Compute dot product on GPU
! Note: Use address of first element (x%f(1), not x)
call gpu_ddot(handle, n, x%f(1), 1, y%f(1), 1, result)
print *, 'Dot product result:', result
! Clean up
deallocate(x_h, y_h)
call gpu_deallocate(x)
call gpu_deallocate(y)
call gpu_blas_destroy(handle)
end program example#include <simple_gpu.F90>
program example_2d
use gpu
implicit none
type(gpu_blas) :: handle
type(gpu_double2) :: a, b, c
double precision, allocatable :: a_h(:,:), b_h(:,:), c_h(:,:)
double precision :: alpha, beta
integer :: m, n, i, j
m = 100
n = 200
! Initialize BLAS handle
call gpu_blas_create(handle)
! Allocate matrices (2D arrays)
call gpu_allocate(a, m, n)
call gpu_allocate(b, m, n)
call gpu_allocate(c, m, n)
! Create and initialize host data
allocate(a_h(m,n), b_h(m,n), c_h(m,n))
do j = 1, n
do i = 1, m
a_h(i,j) = dble(i + j)
b_h(i,j) = dble(i * j)
end do
end do
! Upload to GPU
call gpu_upload(a_h, a)
call gpu_upload(b_h, b)
! Matrix addition: C = alpha*A + beta*B
alpha = 1.5d0
beta = 0.5d0
! Note: Use address of first element (a%f(1,1), not a)
call gpu_dgeam(handle, 'N', 'N', m, n, &
alpha, a%f(1,1), m, beta, b%f(1,1), m, &
c%f(1,1), m)
! Download result from GPU to host
call gpu_download(c, c_h)
! Access result in c_h(:,:)
print *, 'Result at (1,1):', c_h(1,1)
! Clean up
deallocate(a_h, b_h, c_h)
call gpu_deallocate(a)
call gpu_deallocate(b)
call gpu_deallocate(c)
call gpu_blas_destroy(handle)
end program example_2dImportant Note about BLAS Function Arguments:
When calling BLAS functions, always use the address of the first element of the array (e.g., x%f(1) for 1D arrays or a%f(1,1) for 2D arrays), otherwise you may encounter type errors:
! Correct:
call gpu_ddot(handle, n, x%f(1), 1, y%f(1), 1, result)
! Incorrect (may cause type error):
call gpu_ddot(handle, n, x, 1, y, 1, result)Technical Note: The library wrappers use Fortran's c_loc() intrinsic to obtain the memory address of the array element. By passing x%f(1), you're providing the first element as a scalar with the target attribute, which c_loc() then converts to the appropriate C pointer for the underlying BLAS routines.
To use a specific library version, link your application against the desired library:
# CPU version
gfortran -o myapp myapp.f90 -lsimple_gpu-cpu
# GPU version (NVIDIA)
gfortran -o myapp myapp.f90 -lsimple_gpu-nvidia
# GPU version (AMD)
gfortran -o myapp myapp.f90 -lsimple_gpu-amdThe library includes comprehensive unit tests that compare CPU and GPU implementations to ensure correctness.
Run tests with:
make checkFor verbose test output:
make check-verboseFor detailed API documentation, see the comments in:
include/simple_gpu.h: C interface declarationsinclude/simple_gpu.F90: Fortran module and type definitions
- For small problem sizes, CPU version may be faster due to GPU overhead
- GPU version shows significant speedup for larger matrices/vectors
- Use streams for asynchronous operations to overlap computation and data transfer
- Keep data on GPU between operations to minimize transfer overhead
If CUDA is installed but not detected:
./configure --with-cuda=/path/to/cudaSpecify BLAS library explicitly:
./configure --with-blas="-lopenblas"- Ensure NVIDIA drivers are properly installed (for NVIDIA GPUs)
- Ensure ROCm drivers are properly installed (for AMD GPUs)
- Check that CUDA libraries are in your library path:
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
- Check that ROCm libraries are in your library path:
export LD_LIBRARY_PATH=/opt/rocm/lib:$LD_LIBRARY_PATH
- Verify GPU is accessible with
nvidia-smi(NVIDIA) orrocm-smi(AMD)
Contributions are welcome! Please ensure:
- Code follows existing style and conventions
- All tests pass before submitting
- New features include appropriate tests
The library currently implements a core set of BLAS operations (DOT, GEMV, GEMM, GEAM). If you need additional BLAS functions (such as AXPY, SCAL, TRSM, SYRK, etc.), contributions are highly encouraged! Adding new BLAS routines follows the established pattern in the codebase and helps make the library more complete and useful for the community.
The project uses Doxygen to generate API documentation from source code comments.
- Doxygen (version 1.9 or later)
- Graphviz (for generating diagrams)
# Install dependencies (Ubuntu/Debian)
sudo apt-get install doxygen graphviz
# Generate documentation
doxygen Doxyfile
# View documentation
# Open docs/html/index.html in your web browserThe documentation is automatically built and published to GitHub Pages when changes are pushed to the main branch.
See LICENSE file for details.