Work and Note

CUDA Learning

Follow me on GitHub

build with cuda

nvcc

-Xcompiler -fPIC
-Xptxas -v
-keep: store all intermediate file in user folder

cmake legacy

# find nvcc and libs
find_package(CUDA)
CUDA_NVCC_FLAGS
# root dir
CUDA_TOOLKIT_ROOT_DIR

find cuda

cmake modern

# find nvcc
enable_language(CUDA)
CMAKE_CUDA_FLAGS
# find libs, starting from cmake 3.17
FindCUDAToolkit
CUDAToolkit_ROOT

cmake property cmake variable

misc

nvcc basics

ptx: intermediate assembly file cubin: device code binary for a single architecture fatbin: fat binary that may contain multiple ptx and cubin

-link: the default nvcc option. Compile and link all codes. -compile: only compile to object file.

fatbin

typically, a fatbin will contain one ptx because ptx can be JIT compiled for future GPUs. and several cubin for AOT compiled code for current architectures.

In order to view what ptx / sass is in the binary, we can use cuobjdump.

architecture-accelerated features

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#feature-availabilityßß

CUDA 12.0 introduced the concept of “architecture-accelerated features” whose PTX does not have forward compatibility guarantees

Architecture-Specific Feature Set: compute_100a Family-Specific Feature Set: compute_100f

-gencode arch=sm_100 is usually used as short hand for -gencode arch=sm_100,code=sm_100 -gencode arch=compute_30,code=sm_52. This means to generate ptx for compute 30 but generate cubin for sm_52.

separate compilation

By default (and prior to cuda 5.0), device cuda code needs to all be within one file. Now, we can setup separate compilation as described here.

CUDA works by embedding device code into host objects. In whole program compilation, it embeds executable device code into the host object. In separate compilation, we embed relocatable device code into the host object, and run nvlink, the device linker, to link all the device code together. The output of nvlink is then linked together with all the host objects by the host linker to form the final executable.

-rdc: control the generation of relocatable device code. -dlink: invoke device linker only. link rdc, ptx, cubin, fatbin into a host object Note: in default nvcc, device linker will be invoked. But it will do nothing if rdc is not found.

nvcc numeric stability

more on this in ncg post

--use_fast_math = --ftz=true --prec-div=false --prec-sqrt=false --fmad=true
--ftz
--prec-div
--prec-sqrt
--fmad

dynamic parallel

https://devblogs.nvidia.com/introduction-cuda-dynamic-parallelism/ https://devblogs.nvidia.com/cuda-dynamic-parallelism-api-principles/

nvidia-smi order

By default, cuda runtime sees a different order vs nvidia-smi. So, always set the environment in the following way.

# nvidia-smi list in pci bus order. But cuda does not.
export CUDA_DEVICE_ORDER="PCI_BUS_ID"
export CUDA_VISIBLE_DEVICES="0,2"

sanitizer

ASAN mmap a huge virtual memory in order to record the status of actual memory. ASAN will poison virtual memory at boundary, unallocated, freed position. ASAN will replace malloc and free to check for poison.

Cuda unified memory create the unified address space by mapping host and device memory into it. Unfortunately, cuda virtual memory conflict with asan memory.

We can:

  • move the static virtual memory of asan.
    • -mllvm -asan-force-dynamic-shadow=true