How to setup a local coding agent on macOS

	How to setup a local coding agent on macOS(ikyle.me)
	475 points by kkm 47 days ago \| 117 comments
	tl;dr: Running Gemma 4 26B-A4B locally on an M1 Max via llama.cpp with Metal hits 58 tok/s, but adding a Q8 MTP draft model for speculative decoding (with `--spec-draft-n-max 3`) boosts it to 72 tok/s — faster than equivalent MLX setups. Pairing llama-server's OpenAI-compatible endpoint with the Pi terminal agent (configured for both text and image input via the multimodal projector) yields a usable local coding agent with screenshot support. Qwen3.6 35B-A3B is a stronger coder but runs slower at ~55 tok/s.
	HN Discussion: ~MoE models like DeepSeek-V4-Flash work better on unified RAM Macs than the article's choice ↓The benchmark methodology is flawed because 128 tokens is too short to measure MTP speedup accurately ↓Simpler tools like LM Studio, omlx.ai, or ollama+opencode achieve the same setup with less effort ~The huggingface-cli step is unnecessary since llama.cpp can download models directly ~Personal experience confirms MTP speedup is marginal and Gemma 4 MTP head breaks markup in Opencode