| RTX 5080 and RTX 3090 Setup: 80 Tok/s on Qwen 3.6 27B Q8(imil.net) | |
| 281 points by iMil 2 days ago | 103 comments | |
tl;dr: Author combined an RTX 5080 (16GB) and a refurbished RTX 3090 (24GB) on an Asus Prime X570-Pro to run Qwen 3.6 27B at Q8 with a 230k context across 39GB of VRAM. Key setup details include enabling Above 4G Decoding/ReBAR, disabling CSM, building llama.cpp with `CMAKE_CUDA_ARCHITECTURES="86;120"` and NCCL off, and using tensor split mode with MTP+ngram speculative decoding. The result: 80-90 tokens/sec generation, with PCIe running at x8/x8 Gen4. | |
HN Discussion:
| |