Work and Note

CUDA Learning

Follow me on GitHub

build with cuda

nvcc

-Xcompiler -fPIC
-Xptxas -v
-keep: store all intermediate file in user folder

cmake legacy

# find nvcc and libs
find_package(CUDA)
CUDA_NVCC_FLAGS
# root dir
CUDA_TOOLKIT_ROOT_DIR

find cuda

cmake modern

# find nvcc
enable_language(CUDA)
CMAKE_CUDA_FLAGS
# find libs, starting from cmake 3.17
FindCUDAToolkit
CUDAToolkit_ROOT

cmake property cmake variable

misc

nvcc basics

ptx: intermediate assembly file cubin: device code binary for a single architecture fatbin: fat binary that may contain multiple ptx and cubin

-link: the default nvcc option. Compile and link all codes. -compile: only compile to object file.

separate compilation

By default (and prior to cuda 5.0), device cuda code needs to all be within one file. Now, we can setup separate compilation as described here.

CUDA works by embedding device code into host objects. In whole program compilation, it embeds executable device code into the host object. In separate compilation, we embed relocatable device code into the host object, and run nvlink, the device linker, to link all the device code together. The output of nvlink is then linked together with all the host objects by the host linker to form the final executable.

-rdc: control the generation of relocatable device code. -dlink: invoke device linker only. link rdc, ptx, cubin, fatbin into a host object Note: in default nvcc, device linker will be invoked. But it will do nothing if rdc is not found.

nvcc numeric stability

more on this in ncg post

--use_fast_math = --ftz=true --prec-div=false --prec-sqrt=false --fmad=true
--ftz
--prec-div
--prec-sqrt
--fmad

dynamic parallel

https://devblogs.nvidia.com/introduction-cuda-dynamic-parallelism/ https://devblogs.nvidia.com/cuda-dynamic-parallelism-api-principles/

nvidia-smi order

By default, cuda runtime sees a different order vs nvidia-smi. So, always set the environment in the following way.

# nvidia-smi list in pci bus order. But cuda does not.
export CUDA_DEVICE_ORDER="PCI_BUS_ID"
export CUDA_VISIBLE_DEVICES="0,2"

sanitizer

ASAN mmap a huge virtual memory in order to record the status of actual memory. ASAN will poison virtual memory at boundary, unallocated, freed position. ASAN will replace malloc and free to check for poison.

Cuda unified memory create the unified address space by mapping host and device memory into it. Unfortunately, cuda virtual memory conflict with asan memory.

We can:

  • move the static virtual memory of asan.
    • -mllvm -asan-force-dynamic-shadow=true