RTX 5080 and RTX 3090 Setup: 80 Tok/s on Qwen 3.6 27B Q8

	RTX 5080 and RTX 3090 Setup: 80 Tok/s on Qwen 3.6 27B Q8(imil.net)
	281 points by iMil 47 days ago \| 103 comments
	tl;dr: Author combined an RTX 5080 (16GB) and a refurbished RTX 3090 (24GB) on an Asus Prime X570-Pro to run Qwen 3.6 27B at Q8 with a 230k context across 39GB of VRAM. Key setup details include enabling Above 4G Decoding/ReBAR, disabling CSM, building llama.cpp with `CMAKE_CUDA_ARCHITECTURES="86;120"` and NCCL off, and using tensor split mode with MTP+ngram speculative decoding. The result: 80-90 tokens/sec generation, with PCIe running at x8/x8 Gen4.
	HN Discussion: ↑Confirms similar setup performance and shares experience preferring local Qwen over Claude ↓Critiques author's parameter choices and suggests recommended Qwen settings instead ↑Reports comparable or better tok/s on alternative hardware setups, validating MTP approach ~Questions cost-effectiveness vs cloud given electricity prices ↓Wishes article had more theory and explanation rather than just a recipe