RTX 5080 and RTX 3090 Setup: 80 Tok/s on Qwen 3.6 27B Q8(imil.net)
281 points by iMil 2 days ago | 103 comments
tl;dr: Author combined an RTX 5080 (16GB) and a refurbished RTX 3090 (24GB) on an Asus Prime X570-Pro to run Qwen 3.6 27B at Q8 with a 230k context across 39GB of VRAM. Key setup details include enabling Above 4G Decoding/ReBAR, disabling CSM, building llama.cpp with `CMAKE_CUDA_ARCHITECTURES="86;120"` and NCCL off, and using tensor split mode with MTP+ngram speculative decoding. The result: 80-90 tokens/sec generation, with PCIe running at x8/x8 Gen4.
HN Discussion:
  • Confirms similar setup performance and shares experience preferring local Qwen over Claude
  • Critiques author's parameter choices and suggests recommended Qwen settings instead
  • Reports comparable or better tok/s on alternative hardware setups, validating MTP approach
  • ~Questions cost-effectiveness vs cloud given electricity prices
  • Wishes article had more theory and explanation rather than just a recipe