Performance per dollar is getting faster and cheaper

	Performance per dollar is getting faster and cheaper(wafer.ai)
	281 points by latchkey 15 hours ago \| 100 comments
	tl;dr: Wafer benchmarked GLM-5.2 on AMD's MI355X (roughly 2.75x cheaper than NVIDIA's B300) and hit 213 tok/s single-stream and 2626 tok/s/node aggregate throughput—about 80% of B200 performance at less than half the cost. Getting there required MXFP4 quantization via AMD Quark, switching to sglang, and fixing two ROCm bugs blocking speculative decode plus tuning the MoE kernel selection, but notably no custom kernels. The takeaway: NVIDIA's CUDA moat is increasingly about day-0 model support rather than fundamental software superiority.
	HN Discussion: ↓FP4 quantization degrades model quality, undermining the performance claims ↓Headlines should disclose quantization since benchmarks aren't comparing full-fat models •Requesting performance-per-watt metrics to better evaluate AMD's competitiveness ↓The aggregate throughput number is misleading versus real single-stream throughput ~NVIDIA's next-gen Rubin will leapfrog Blackwell on inference, limiting AMD's window