execution
- profile guide has a good general introduction
- compute capability
- organization: grid > block > warp > thread
kernel<<< blocks, threads, smem, stream>>>();
grid
- one kernel call correspond to one grid
- grid split into many blocks
block
- also known as Cooperative Thread Array (CTA)
- block can only be assigned to one SM
- one SM can hold more than one block
- one block can have up to 1024 threads
warp
- group of 32 threads
- the threads in one warp is executed at the same time
If the number of threads in a CTA is not dividable by 32, the last warp will contain the remaining number of threads
thread
- thread are identified by: blockIdx * blockDim + threadIdx
- block per grid: can contain up to 3 dimension
- thread per block: can contain up to 3 dimension
memory
Type | Specifier | Position | Cache |
---|---|---|---|
register | N/A | on chip | |
shared | __shared__ |
on chip | L1 |
global | off chip | L2 | |
constant | __constant__ |
- L1 cache can split between register spill + shared memory
cudaDeviceSetCacheConfig
- local memory (refers to register + spill) can only be accessed by current thread
- register count can be controlled when compiling with
maxregcount
- or deducted from
__launch_bounds__(MAX_THREADS_PER_BLOCK, MIN_BLOCKS_PER_MP)
- or deducted from
- stored in global memory
- usually available through cache
- the access is controlled by compiler
- register count can be controlled when compiling with
- shared memory can be shared across a block
- sync is most likely required
- read-only memory: instruction cache, constant memory, texture memory and RO cache
cudaMemcpyToSymbol
host memory
- host vs device
- pcie link is usually the bottleneck
cudaHostAlloc
vscudaMalloc
cudaHostRegister
shared memory
- static:
__shared__ int s[64];
- dynamic:
extern __shared__ int s[];
- require smem in kernel launch
- shared memory is access via bank
- for same address
- for read, we will have a boardcast. OK
- for write, we will have a conflict
- for same bank, we will have a serial acess for r / w. Bad
- for different bank, we can have a higher bandwidth for r / w
- how configurable width is done in hardware?
- for same address
Shared memory has 32 banks that are organized such that successive 32-bit words map to successive banks that can be accessed simultaneously
vectorized memory access
- we can generate a wider copy instruction by using int2, int4 or float2
- require alignment
- fill multiple cache lines in a single fetch
unified memory
cudaMallocManaged()
cudaMemAdvise()
cudaMemPrefetchAsync()
cudaStreamAttachMemAsync()
Global memory is a 49-bit virtual address space that is mapped to physical memory on the device, pinned system memory, or peer memory.
- supposely based on page fault
- cannot access concurrently from host and device
- only allocated when used -> can appear on gpu
- sync mmu on cpu and gpu?
(not) uniform memory access
- for UMA, cpu should be able to operate directly on the memory
- in cuda, the driver takes care of memory transfer?
memory best practice
- multi-thread + multi-word + multi-iteration
- coalesced access: let consecutive thread access consecutive data
- otherwise, multiple fetch might be needed
- scatter & gather exist, but avoid if you can
- scatter: read seq, write rand
- gather: read rand, write seq
- array of struct vs struct of array
- aos is prefered in single thread
- aos will became strided access in simt
- soa is prefered in simt
- aos is prefered in single thread
- trove
- convert aos to soa for execution on cuda
host sync
- kernel calls are async wrt host
- use
CUDA_LAUNCH_BLOCKING=1
to debug kernel launch
- use
- multiple kernel can operate at the same time
- which require kernel to be launched in different stream
- behavior can be changed when set up stream
- default stream is in sync with all other stream
- behavior can be changed via compile option
cudaStreamNonBlocking
cudaStreamAddCallback
cudaStreamCreateWithPriority
- which require kernel to be launched in different stream
cudaDeviceSynchronize
cudaStreamSynchronize
cudaStreamWaitEvent
Sync Rule API Sync Stream Sync
device sync
__syncthreads()
: wait for all threads in the same block- dead lock if some threads do not make the call
cooperative group
To achieve sub-block sync, we use cooperative group
nvprof metric
- issue efficiency
- branch_efficiency
- gld_throughput
- inst_per_warp
- divergence can cause high inst
- stall_sync
occupancy
- achieved vs theoretical
- bound by hardware
- Warps per SM
- Blocks per SM
- bound by resorce sharing
- Registers per SM
- Shared Memory per SM
A warp is considered active from the time its threads begin executing to the time when all threads in the warp have exited from the kernel.
Hyper-Q
Starting from Kepler, CUDA kernels can be processed concurrently on the same GPU.
Multi-Process Service (MPS)
Enable multiple processes to co-operate.
Starting from Volta, MPS client talk to GPU without passing through MPS server.
Multi-Instance GPU (MIG)
Supported in A100. Let GPU to be partitioned into separate GPU instance. I suppose it is best fit for multiple users.
tensor core
- tensor core has a fixed calculation pipeline
out = A * B + C
- accessed via
nvcuda::wmma
todo
access
shuffle