GLM-5.1: Z.ai 744B Open-Weight Model Tops SWE-Bench Pro, Beats GPT-5.4 and Claude Opus 4.6

GLM-5.1: Z.ai 744B Open-Weight Model Tops SWE-Bench Pro, Beats GPT-5.4 and Claude Opus 4.6

Z.ai just released GLM-5.1, a 744-billion-parameter open-weight model that claims the top spot on SWE-Bench Pro — the most rigorous public benchmark for AI coding agents. More striking: it was trained entirely on Huawei chips with zero involvement from Nvidia, making it a significant milestone for Chinese AI infrastructure independence.

What GLM-5.1 Is

GLM-5.1 is a post-training upgrade to Z.ai's GLM-5 foundation model, using a Mixture-of-Experts architecture with 744 billion total parameters and 40 billion active parameters per token. It has a 200,000-token context window and supports up to 131,072 output tokens — making it suitable for very long-form coding and agentic tasks.

The model is open-weight under the MIT license and available on Hugging Face, meaning any developer or organization can download, fine-tune, and deploy it without licensing fees or API rate limits.

The Benchmark Numbers

On SWE-Bench Pro — a benchmark requiring AI to fix real GitHub issues in large codebases — GLM-5.1 scored 58.4, edging out GPT-5.4 (57.7) and Claude Opus 4.6 (57.3). It is the first open-weight model to claim the top position on that benchmark, which has historically been dominated by closed frontier models.

Additional results: 68.7 on CyberGym (up from 48.3 for GLM-5), 68.0 on BrowseComp, 70.6 on the τ³-Bench, and 71.8 on MCP-Atlas. On Z.ai's own internal coding evaluation using Claude Code as the test harness, GLM-5.1 scored 45.3 versus Claude Opus 4.6's 47.9 — within 2.6 points of the current best closed model.

The Agentic Capability That Stands Out

GLM-5.1 is designed specifically for long-horizon agentic tasks. It can operate autonomously for more than eight hours, executing hundreds of sequential steps, making thousands of tool calls, and self-correcting across extended workflows without human intervention. That kind of sustained autonomous execution is rare even among frontier models.

The improvements over GLM-5 come entirely from reinforcement learning and alignment refinements — not additional pre-training. Z.ai essentially squeeze-tuned the existing foundation into a more capable agentic system, which is an increasingly common and cost-effective strategy in the post-GPT-4 era.

The Bottom Line

GLM-5.1 is the strongest evidence yet that open-weight models can compete at the frontier — specifically on coding and agentic tasks. For developers who want frontier-level coding capability without per-token cloud costs, and organizations who cannot or will not send code to a third-party API, GLM-5.1 is now the most compelling option available. The Nvidia-free training story also signals that Chinese AI labs are building serious infrastructure alternatives, regardless of export control pressures.