How to setup a local coding agent on macOS(ikyle.me)
475 points by kkm 2 days ago | 117 comments
tl;dr: Running Gemma 4 26B-A4B locally on an M1 Max via llama.cpp with Metal hits 58 tok/s, but adding a Q8 MTP draft model for speculative decoding (with `--spec-draft-n-max 3`) boosts it to 72 tok/s — faster than equivalent MLX setups. Pairing llama-server's OpenAI-compatible endpoint with the Pi terminal agent (configured for both text and image input via the multimodal projector) yields a usable local coding agent with screenshot support. Qwen3.6 35B-A3B is a stronger coder but runs slower at ~55 tok/s.
HN Discussion:
  • ~MoE models like DeepSeek-V4-Flash work better on unified RAM Macs than the article's choice
  • The benchmark methodology is flawed because 128 tokens is too short to measure MTP speedup accurately
  • Simpler tools like LM Studio, omlx.ai, or ollama+opencode achieve the same setup with less effort
  • ~The huggingface-cli step is unnecessary since llama.cpp can download models directly
  • ~Personal experience confirms MTP speedup is marginal and Gemma 4 MTP head breaks markup in Opencode