| What happens when you run a CUDA kernel?(fergusfinn.com) | |
| 281 points by mezark 1 day ago | 31 comments | |
tl;dr: A simple CUDA vector-add kernel triggers an enormous stack: nvcc chains cicc and ptxas to produce PTX and SASS bundled into a fatbin embedded in an ELF, while the host stub registers the kernel and packs arguments into constant bank 0. At launch, libcuda builds a QMD, writes it as methods into a pushbuffer, advances GP_PUT, and rings an MMIO doorbell; the GPU's work distributor then spreads 4096 blocks across 128 SMs, with ptxas-encoded stall counts and scoreboard barriers governing warp eligibility, ultimately bottlenecked by DRAM bandwidth at ~80% of peak. | |
HN Discussion:
| |