GLM-5 and GLM-5.1#
GLM-5 and GLM-5.1 are large sparse MoE language models with Multi-Latent Attention and Dynamic Sparse Attention. Megatron Bridge supports both checkpoints through the shared GLM5Bridge.
Supported Variants#
Variant |
Hugging Face ID |
Notes |
|---|---|---|
GLM-5 |
|
MoE + MLA + DSA architecture |
GLM-5.1 |
|
Same architecture and mapping shape as GLM-5 |
Architecture Notes#
GlmMoeDsaForCausalLMarchitecture with 78 transformer layers.First 3 layers are dense; remaining layers use MoE.
256 routed experts with top-8 routing and one shared expert per MoE layer.
Uses MLA plus DSA indexer parameters (
index_head_dim,index_n_heads,index_topk).Requires
transformers >= 5.2.0.DSA requires the
fast-hadamard-transformCUDA extension and MCore support for the DSA experimental attention variant.
Examples#
For conversion, inference, dependency notes, hardware requirements, and MCore patch requirements, see the GLM-5 examples README.