What happens when you run a CUDA kernel?(fergusfinn.com)
281 points by mezark 1 day ago | 31 comments
tl;dr: A simple CUDA vector-add kernel triggers an enormous stack: nvcc chains cicc and ptxas to produce PTX and SASS bundled into a fatbin embedded in an ELF, while the host stub registers the kernel and packs arguments into constant bank 0. At launch, libcuda builds a QMD, writes it as methods into a pushbuffer, advances GP_PUT, and rings an MMIO doorbell; the GPU's work distributor then spreads 4096 blocks across 128 SMs, with ptxas-encoded stall counts and scoreboard barriers governing warp eligibility, ultimately bottlenecked by DRAM bandwidth at ~80% of peak.
HN Discussion:
  • Article is a valuable educational resource that would have helped during CUDA coursework
  • The CPU-to-driver-to-GPU path explanation (QMD, doorbell) fills a gap most tutorials miss
  • ~Technical corrections or additions, such as control codes being table lookups rather than simple bits
  • ~Much of the runtime API 'voodoo' can be avoided by using the driver API directly
  • Open hardware documentation exists, so kernel source isn't strictly needed for QMD/method formats