GLM-5 / GLM-5.1 (MoE + DSA)
GLM-5 / GLM-5.1 (MoE + DSA)
GLM-5 and GLM-5.1 are Zhipu AI’s latest open-source large Mixture-of-Experts models featuring a DeepSeek-style MLA (Multi-head Latent Attention) + DSA (Dynamic Sparse Attention) architecture. GLM-5.1 shares the glm_moe_dsa architecture with GLM-5, with updated weights.
Key Features
- Mixture of Experts (MoE): 256 routed experts with 8 active per token
- 78 layers, hidden size 6144, with MLA using KV compression (kv_lora_rank=512) and head_dim=64
- ~200k context window (max_position_embeddings=202,752)
- 3 dense layers followed by MoE layers (first_k_dense_replace=3)
Available Models
- GLM-5 (
GlmMoeDsaForCausalLM) - GLM-5.1 (
GlmMoeDsaForCausalLM): updated weights
Example HF Models
Example Recipes
Parallel Setup
The recipe scales training using Expert Parallelism and Pipeline Parallelism (EP=64, PP=4 across 32 nodes of 8× H100 GPUs).
Try with NeMo AutoModel
1. Install (full instructions):
2. Clone the repo to get the example recipes:
This recipe was validated on 32 nodes × 8 GPUs (256 H100s). See the Launcher Guide for multi-node setup.
3. Run the recipe from inside the repo:
Run with Docker
1. Pull the container and mount a checkpoint directory:
2. Navigate to the AutoModel directory (where the recipes are):
3. Run the recipe:
See the Installation Guide and LLM Fine-Tuning Guide.
Fine-Tuning
See the LLM Fine-Tuning Guide and the Large MoE Fine-Tuning Guide.