|
Template Numerical Library version\ main:b4a9fa1
|
The following sections document design decisions of the TNL library. It is relevant for TNL developers rather than users.
TNL targets both NVIDIA (CUDA) and AMD (HIP/ROCm) GPUs through shared GPU source files. The following conventions ensure code compiles and runs correctly on both platforms.
Prefer Devices::GPU over Devices::Cuda || Devices::Hip in if constexpr checks:
There should be no case where the code path is different for each backend.
NVIDIA GPUs always use warp size 32. AMD GPUs have variable wavefront size: 64 on gfx8/gfx9 (GCN, CDNA), 32 on gfx10+ (RDNA). Furthermore, AMD does not provide a way to get the wavefront size at compile-time. The __AMDGCN_WAVEFRONT_SIZE__ macro that previously provided a compile-time wavefront size was deprecated in ROCm 7.0 and removed in ROCm 7.2 (see release notes). This creates a host/device compilation split on HIP because architecture macros like __GFX8__ and __GFX9__ are defined only during device compilation — host code never sees them.
Use the warp-size API from <TNL/Backend/LaunchHelpers.h>:
| Function | Scope | Return value |
|---|---|---|
| getWarpSize() | Device only | Actual warp size for the target architecture |
| getMaxWarpSize() | Host + device | Maximum across all architectures in the build |
| getMinWarpSize() | Host + device | Always 32 |
| getWarpSize(deviceId) | Host only | Runtime query via hipDeviceGetAttribute |
Rules:
The typical pattern for guard-plus-runtime-dispatch is:
This ensures TPS=64 kernel instantiations exist in the HIP fat binary for wave64 devices, while CUDA builds pay zero binary bloat.
Use cooperative groups instead of __syncwarp(), which is not available on HIP. TNL provides a namespace alias in <TNL/Backend/Functions.h>:
Inside GPU kernels:
HIP shuffle intrinsics (__shfl_down_sync etc.) differ from CUDA in two ways:
Wrap all CUDA/HIP runtime API calls with TNL_BACKEND_SAFE_CALL(call). This maps to cudaGetErrorString/hipGetErrorString and throws BackendRuntimeError on failure.