What happens when you run a CUDA kernel?

	What happens when you run a CUDA kernel?(fergusfinn.com)
	281 points by mezark 1 day ago \| 31 comments
	tl;dr: A simple CUDA vector-add kernel triggers an enormous stack: nvcc chains cicc and ptxas to produce PTX and SASS bundled into a fatbin embedded in an ELF, while the host stub registers the kernel and packs arguments into constant bank 0. At launch, libcuda builds a QMD, writes it as methods into a pushbuffer, advances GP_PUT, and rings an MMIO doorbell; the GPU's work distributor then spreads 4096 blocks across 128 SMs, with ptxas-encoded stall counts and scoreboard barriers governing warp eligibility, ultimately bottlenecked by DRAM bandwidth at ~80% of peak.
	HN Discussion: ↑Article is a valuable educational resource that would have helped during CUDA coursework ↑The CPU-to-driver-to-GPU path explanation (QMD, doorbell) fills a gap most tutorials miss ~Technical corrections or additions, such as control codes being table lookups rather than simple bits ~Much of the runtime API 'voodoo' can be avoided by using the driver API directly •Open hardware documentation exists, so kernel source isn't strictly needed for QMD/method formats