GLM-5 and GLM-5.1#

GLM-5 and GLM-5.1 are large sparse MoE language models with Multi-Latent Attention and Dynamic Sparse Attention. Megatron Bridge supports both checkpoints through the shared GLM5Bridge.

Supported Variants#

Variant

Hugging Face ID

Notes

GLM-5

zai-org/GLM-5

MoE + MLA + DSA architecture

GLM-5.1

zai-org/GLM-5.1

Same architecture and mapping shape as GLM-5

Architecture Notes#

  • GlmMoeDsaForCausalLM architecture with 78 transformer layers.

  • First 3 layers are dense; remaining layers use MoE.

  • 256 routed experts with top-8 routing and one shared expert per MoE layer.

  • Uses MLA plus DSA indexer parameters (index_head_dim, index_n_heads, index_topk).

  • Requires transformers >= 5.2.0.

  • DSA requires the fast-hadamard-transform CUDA extension and MCore support for the DSA experimental attention variant.

Examples#

For conversion, inference, dependency notes, hardware requirements, and MCore patch requirements, see the GLM-5 examples README.