build with cuda
-Xcompiler -fPIC
-Xptxas -v
-keep: store all intermediate file in user folder
cmake legacy
# find nvcc and libs
find_package(CUDA)
CUDA_NVCC_FLAGS
# root dir
CUDA_TOOLKIT_ROOT_DIR
cmake modern
# find nvcc
enable_language(CUDA)
CMAKE_CUDA_FLAGS
# find libs, starting from cmake 3.17
FindCUDAToolkit
CUDAToolkit_ROOT
misc
nvcc basics
ptx
: intermediate assembly file
cubin
: device code binary for a single architecture
fatbin
: fat binary that may contain multiple ptx and cubin
-link
: the default nvcc option. Compile and link all codes.
-compile
: only compile to object file.
separate compilation
By default (and prior to cuda 5.0), device cuda code needs to all be within one file. Now, we can setup separate compilation as described here.
CUDA works by embedding device code into host objects. In whole program compilation, it embeds executable device code into the host object. In separate compilation, we embed relocatable device code into the host object, and run nvlink, the device linker, to link all the device code together. The output of nvlink is then linked together with all the host objects by the host linker to form the final executable.
-rdc
: control the generation of relocatable device code.
-dlink
: invoke device linker only. link rdc, ptx, cubin, fatbin into a host object
Note: in default nvcc, device linker will be invoked. But it will do nothing if rdc is not found.
nvcc numeric stability
more on this in ncg post
--use_fast_math = --ftz=true --prec-div=false --prec-sqrt=false --fmad=true
--ftz
--prec-div
--prec-sqrt
--fmad
dynamic parallel
https://devblogs.nvidia.com/introduction-cuda-dynamic-parallelism/ https://devblogs.nvidia.com/cuda-dynamic-parallelism-api-principles/
nvidia-smi order
By default, cuda runtime sees a different order vs nvidia-smi. So, always set the environment in the following way.
# nvidia-smi list in pci bus order. But cuda does not.
export CUDA_DEVICE_ORDER="PCI_BUS_ID"
export CUDA_VISIBLE_DEVICES="0,2"
sanitizer
ASAN mmap
a huge virtual memory in order to record the status of actual memory.
ASAN will poison
virtual memory at boundary, unallocated, freed position.
ASAN will replace malloc
and free
to check for poison.
Cuda unified memory create the unified address space by mapping host and device memory into it. Unfortunately, cuda virtual memory conflict with asan memory.
We can:
- move the static virtual memory of asan.
-mllvm -asan-force-dynamic-shadow=true