Cuda Usage

execution

profile guide has a good general introduction
compute capability
organization: grid > block > warp > thread
kernel<<< blocks, threads, smem, stream>>>();

grid

one kernel call correspond to one grid
grid split into many blocks

block

also known as Cooperative Thread Array (CTA)
block can only be assigned to one SM
one SM can hold more than one block
one block can have up to 1024 threads

warp

group of 32 threads
- the threads in one warp is executed at the same time

If the number of threads in a CTA is not dividable by 32, the last warp will contain the remaining number of threads

thread

thread are identified by: blockIdx * blockDim + threadIdx
- block per grid: can contain up to 3 dimension
- thread per block: can contain up to 3 dimension

memory

Type	Specifier	Position	Cache
register	N/A	on chip
shared	`__shared__`	on chip	L1
global		off chip	L2
constant	`__constant__`

L1 cache can split between register spill + shared memory
- cudaDeviceSetCacheConfig
local memory (refers to register + spill) can only be accessed by current thread
- register count can be controlled when compiling with maxregcount
  - or deducted from __launch_bounds__(MAX_THREADS_PER_BLOCK, MIN_BLOCKS_PER_MP)
- stored in global memory
  - usually available through cache
  - the access is controlled by compiler
shared memory can be shared across a block
- sync is most likely required
read-only memory: instruction cache, constant memory, texture memory and RO cache
- cudaMemcpyToSymbol

host memory

host vs device
- pcie link is usually the bottleneck
- cudaHostAlloc vs cudaMalloc
- cudaHostRegister

shared memory

static: __shared__ int s[64];
dynamic: extern __shared__ int s[];
- require smem in kernel launch
shared memory is access via bank
- for same address
  - for read, we will have a boardcast. OK
  - for write, we will have a conflict
- for same bank, we will have a serial acess for r / w. Bad
- for different bank, we can have a higher bandwidth for r / w
  - how configurable width is done in hardware?

Shared memory has 32 banks that are organized such that successive 32-bit words map to successive banks that can be accessed simultaneously

vectorized memory access

we can generate a wider copy instruction by using int2, int4 or float2
- require alignment
fill multiple cache lines in a single fetch

unified memory

cudaMallocManaged()
cudaMemAdvise()
cudaMemPrefetchAsync()
cudaStreamAttachMemAsync()

Global memory is a 49-bit virtual address space that is mapped to physical memory on the device, pinned system memory, or peer memory.

supposely based on page fault
- cannot access concurrently from host and device
- only allocated when used -> can appear on gpu
- sync mmu on cpu and gpu?

(not) uniform memory access

for UMA, cpu should be able to operate directly on the memory
in cuda, the driver takes care of memory transfer?

memory best practice

multi-thread + multi-word + multi-iteration
coalesced access: let consecutive thread access consecutive data
- otherwise, multiple fetch might be needed
- scatter & gather exist, but avoid if you can
  - scatter: read seq, write rand
  - gather: read rand, write seq
array of struct vs struct of array
- aos is prefered in single thread
  - aos will became strided access in simt
- soa is prefered in simt
trove
- convert aos to soa for execution on cuda

host sync

kernel calls are async wrt host
- use CUDA_LAUNCH_BLOCKING=1 to debug kernel launch
multiple kernel can operate at the same time
- which require kernel to be launched in different stream
  - behavior can be changed when set up stream
- default stream is in sync with all other stream
  - behavior can be changed via compile option
- cudaStreamNonBlocking
- cudaStreamAddCallback
- cudaStreamCreateWithPriority

cudaDeviceSynchronize cudaStreamSynchronize cudaStreamWaitEvent

Sync Rule API Sync Stream Sync

device sync

__syncthreads(): wait for all threads in the same block
- dead lock if some threads do not make the call

cooperative group

To achieve sub-block sync, we use cooperative group

nvprof metric

issue efficiency
branch_efficiency
gld_throughput
inst_per_warp
- divergence can cause high inst
stall_sync

occupancy

achieved vs theoretical
bound by hardware
- Warps per SM
- Blocks per SM
bound by resorce sharing
- Registers per SM
- Shared Memory per SM

A warp is considered active from the time its threads begin executing to the time when all threads in the warp have exited from the kernel.

Hyper-Q

Starting from Kepler, CUDA kernels can be processed concurrently on the same GPU.

Multi-Process Service (MPS)

Enable multiple processes to co-operate.

Starting from Volta, MPS client talk to GPU without passing through MPS server.

Reference

Multi-Instance GPU (MIG)

Supported in A100. Let GPU to be partitioned into separate GPU instance. I suppose it is best fit for multiple users.

Reference

tensor core

tensor core has a fixed calculation pipeline
- out = A * B + C
accessed via nvcuda::wmma

todo

access

shuffle