GLM-5 and GLM-5.1#

GLM-5 and GLM-5.1 are large sparse MoE language models with Multi-Latent Attention and Dynamic Sparse Attention. Megatron Bridge supports both checkpoints through the shared GLM5Bridge.

Supported Variants#

Variant	Hugging Face ID	Notes
GLM-5	`zai-org/GLM-5`	MoE + MLA + DSA architecture
GLM-5.1	`zai-org/GLM-5.1`	Same architecture and mapping shape as GLM-5

Architecture Notes#

GlmMoeDsaForCausalLM architecture with 78 transformer layers.
First 3 layers are dense; remaining layers use MoE.
256 routed experts with top-8 routing and one shared expert per MoE layer.
Uses MLA plus DSA indexer parameters (index_head_dim, index_n_heads, index_topk).
Requires transformers >= 5.2.0.
DSA requires the fast-hadamard-transform CUDA extension and MCore support for the DSA experimental attention variant.

Examples#

For conversion, inference, dependency notes, hardware requirements, and MCore patch requirements, see the GLM-5 examples README.