GLM 5.2 beats Claude in our benchmarks(semgrep.dev)
928 points by jms703 19 hours ago | 424 comments
tl;dr: Semgrep benchmarked open-weight and frontier models on IDOR vulnerability detection and found Zhipu AI's GLM 5.2 scored 39% F1 with just a bare prompt, beating Claude Code (32%) at roughly $0.17 per bug found. Both were beaten by Semgrep's own multimodal pipeline (53-61% F1), suggesting the harness/scaffolding matters more than the underlying model. The authors caution this is a single task on one dataset, but argue GLM 5.2's performance at ~1/6 the cost of frontier models—plus the ability to run locally—makes open weights newly viable for security teams.
HN Discussion:
  • GLM 5.2 is a genuinely capable, cost-effective workhorse for daily coding tasks
  • Open Chinese models are catching up or surpassing US frontier models, especially in specific domains like cybersecurity
  • ~Other open models like DeepSeek may actually outperform GLM 5.2 across broader benchmarks
  • The article's title and conclusions are misleading; one narrow benchmark doesn't generalize and terminology is sloppy
  • ~Coding-focused evaluation ignores broader concerns like model bias and non-programmer use cases