GLM 4.5#
GLM 4.5 is a family of large-scale Mixture-of-Experts (MoE) language models developed by Zhipu AI. Built on the GLM (General Language Model) architecture, GLM 4.5 introduces advanced features including sparse MoE layers, shared expert mechanisms, and Multi-Token Prediction (MTP) for improved training efficiency and performance.
GLM 4.5 models are designed for high-performance text generation and understanding tasks, offering excellent quality with efficient inference through sparse activation. The family includes two variants optimized for different deployment scenarios.
GLM 4.5 models are supported via the Bridge system with auto-detected configuration and weight mapping.
Available Models#
GLM 4.5 355B-A32B#
Full Model: zai-org/GLM-4.5 (355B total parameters, 32B active per token)
Architecture:
92 transformer layers (first 3 dense, remaining 89 MoE)
5120 hidden size, 12288 FFN hidden size
96 attention heads, 8 query groups (GQA)
160 experts per MoE layer, top-8 routing with 2.5x scaling factor
MoE FFN hidden size: 1536
Shared expert intermediate size: 1536
Includes QK LayerNorm for improved training stability
Context: 131,072 tokens
Vocabulary: 151,552 tokens
Optimizations:
Shared expert overlap for better load balancing
Router sigmoid scoring with expert bias
MTP (Multi-Token Prediction) with 1 layer, 0.3 scaling factor
GLM 4.5 Air 106B-A12B#
Full Model: zai-org/GLM-4.5-Air (106B total parameters, 12B active per token)
Architecture:
46 transformer layers (first layer dense, remaining 45 MoE)
4096 hidden size, 10944 FFN hidden size
96 attention heads, 8 query groups (GQA)
128 experts per MoE layer, top-8 routing with 1.0x scaling factor
MoE FFN hidden size: 1408
Shared expert intermediate size: 1408
No QK LayerNorm
Context: 131,072 tokens
Vocabulary: 151,552 tokens
Optimizations:
Optimized for reduced memory footprint
Suitable for mid-range GPU clusters
Both models use RMSNorm, SiLU activation, gated linear units, and RoPE with 1M base frequency.
Model Architecture Features#
GLM 4.5 introduces several advanced architectural innovations:
Mixture-of-Experts (MoE): Sparse activation with 160/128 experts, activating only 8 per token for efficient scaling
Shared Expert Mechanism: Dedicated shared experts with overlap for improved load balancing and knowledge transfer
Multi-Token Prediction (MTP): Predicts multiple future tokens simultaneously for better training efficiency
Configurable MTP layers (default: 1 layer)
Adjustable loss scaling factor (default: 0.3 for early training, 0.1 for later stages)
RoPE with Extended Context: Rotary embeddings with 1M base frequency for robust long-context modeling
Advanced Router Design:
Sigmoid score function for better expert selection
Expert bias with configurable update rate
Auxiliary loss for load balancing
Grouped Query Attention (GQA): 8 query groups for efficient attention computation
RMSNorm: Fast layer normalization without mean centering
Conversion with 🤗 Hugging Face#
Load HF → Megatron#
from megatron.bridge import AutoBridge
# Example: GLM 4.5 Air 106B
bridge = AutoBridge.from_hf_pretrained("zai-org/GLM-4.5-Air")
provider = bridge.to_megatron_provider()
# Configure parallelism before instantiating the model
# For Air 106B: TP=1, PP=4, EP=8 (32 GPUs)
provider.tensor_model_parallel_size = 1
provider.pipeline_model_parallel_size = 4
provider.expert_model_parallel_size = 8
provider.sequence_parallel = True
provider.finalize()
model = provider.provide_distributed_model(wrap_with_ddp=False)
Import HF → Megatron#
# Import GLM 4.5 Air model
python examples/conversion/convert_checkpoints.py import \
--hf-model zai-org/GLM-4.5-Air \
--megatron-path /models/glm45-air-106b
# Import GLM 4.5 355B model
python examples/conversion/convert_checkpoints.py import \
--hf-model zai-org/GLM-4.5 \
--megatron-path /models/glm45-355b
Export Megatron → HF#
python examples/conversion/convert_checkpoints.py export \
--hf-model zai-org/GLM-4.5-Air \
--megatron-path /results/glm45_air/checkpoints/iter_00001000 \
--hf-path ./glm45-air-hf-export
Run Inference on Converted Checkpoint#
python examples/conversion/hf_to_megatron_generate_text.py \
--hf_model_path zai-org/GLM-4.5-Air \
--megatron_model_path /models/glm45-air-106b \
--prompt "Explain quantum computing in simple terms." \
--max_new_tokens 200
Pretrain and Finetune Recipes#
Available Recipes#
Pretrain recipes:
glm45_355b_pretrain_config: Pre-training for GLM 4.5 355Bglm45_air_106b_pretrain_config: Pre-training for GLM 4.5 Air 106B
Finetune recipes:
glm45_355b_finetune_config: Finetuning for GLM 4.5 355B with PEFT supportglm45_air_106b_finetune_config: Finetuning for GLM 4.5 Air 106B with PEFT support
Parallelism Configurations#
GLM 4.5 355B-A32B#
Mode |
TP |
PP |
EP |
Total GPUs |
Use Case |
|---|---|---|---|---|---|
Pretrain |
2 |
8 |
16 |
256 |
Full pre-training |
PEFT (LoRA/DoRA) |
2 |
4 |
4 |
32 |
Parameter-efficient finetuning |
Full SFT |
2 |
8 |
16 |
256 |
Full supervised finetuning |
GLM 4.5 Air 106B-A12B#
Mode |
TP |
PP |
EP |
Total GPUs |
Use Case |
|---|---|---|---|---|---|
Pretrain |
1 |
4 |
8 |
32 |
Full pre-training |
PEFT (LoRA/DoRA) |
1 |
2 |
4 |
8 |
Parameter-efficient finetuning (1 node!) |
Full SFT |
1 |
4 |
8 |
32 |
Full supervised finetuning |
Pre-training Example#
from megatron.bridge.recipes.glm import glm45_air_106b_pretrain_config
# Create a pre-training configuration
config = glm45_air_106b_pretrain_config(
name="glm45_air_pretrain",
data_paths=["path/to/data"],
train_iters=100000,
global_batch_size=2048,
micro_batch_size=1,
lr=1e-4,
min_lr=1e-5,
# MTP configuration
mtp_num_layers=1,
mtp_loss_scaling_factor=0.3, # 0.3 for first 15T tokens, 0.1 after
# Recompute for memory efficiency
recompute_granularity="selective",
)
Finetuning Examples#
Full Finetuning (GLM 4.5 Air)#
from megatron.bridge.recipes.glm import glm45_air_106b_finetune_config
config = glm45_air_106b_finetune_config(
name="glm45_air_full_finetune",
pretrained_checkpoint="/models/glm45-air-106b",
peft=None, # Full finetuning
train_iters=1000,
global_batch_size=128,
micro_batch_size=1,
finetune_lr=5e-6,
)
LoRA Finetuning (GLM 4.5 Air)#
from megatron.bridge.recipes.glm import glm45_air_106b_finetune_config
config = glm45_air_106b_finetune_config(
name="glm45_air_lora_finetune",
pretrained_checkpoint="/models/glm45-air-106b",
peft="lora", # or "dora"
train_iters=1000,
global_batch_size=128,
micro_batch_size=1,
finetune_lr=1e-4,
# Uses TP=1, PP=2, EP=4 (8 GPUs) automatically
)
DoRA Finetuning (GLM 4.5 355B)#
from megatron.bridge.recipes.glm import glm45_355b_finetune_config
config = glm45_355b_finetune_config(
name="glm45_355b_dora_finetune",
pretrained_checkpoint="/models/glm45-355b",
peft="dora",
train_iters=1000,
global_batch_size=128,
micro_batch_size=1,
finetune_lr=1e-4,
# Uses TP=2, PP=4, EP=4 (32 GPUs) automatically
)
Command-Line Training#
# GLM 4.5 Air - LoRA finetuning on single node (8 GPUs)
torchrun --nproc-per-node=8 run/run_recipe.py \
--pretrained-checkpoint /models/glm45-air-106b \
--recipe glm45_air_106b_finetune_config \
peft=lora \
train.global_batch_size=128 \
train.train_iters=1000 \
checkpoint.save=$SAVE_DIR/glm45_air_lora
# GLM 4.5 355B - Full finetuning (256 GPUs)
torchrun --nnodes=32 --nproc-per-node=8 run/run_recipe.py \
--pretrained-checkpoint /models/glm45-355b \
--recipe glm45_355b_finetune_config \
peft=None \
train.global_batch_size=256 \
train.train_iters=1000 \
checkpoint.save=$SAVE_DIR/glm45_355b_full
Advanced Configuration#
Multi-Token Prediction (MTP)#
MTP can be configured for improved training efficiency:
config = glm45_355b_pretrain_config(
name="glm45_with_mtp",
mtp_num_layers=1, # Number of MTP prediction layers
mtp_loss_scaling_factor=0.3, # 0.3 early training, 0.1 later
# Set to None or 0 to disable MTP
)
Activation Recomputation#
For memory-constrained scenarios:
config = glm45_air_106b_pretrain_config(
name="glm45_with_recompute",
recompute_granularity="selective", # or "full"
recompute_method="uniform",
recompute_num_layers=2,
)
Expert Parallelism Tuning#
Adjust expert parallelism based on your cluster:
config = glm45_air_106b_pretrain_config(
name="glm45_custom_parallelism",
tensor_model_parallel_size=1,
pipeline_model_parallel_size=4,
expert_model_parallel_size=16, # Adjust based on GPU count
sequence_parallel=True,
)
Examples#
Checkpoint import/export: examples/conversion/convert_checkpoints.py
Generate text (HF→Megatron): examples/conversion/hf_to_megatron_generate_text.py
Hugging Face Model Cards#
GLM 4.5 355B: https://huggingface.co/zai-org/GLM-4.5
GLM 4.5 Air 106B: https://huggingface.co/zai-org/GLM-4.5-Air
Key Features for Production#
Efficient Scaling: MoE architecture enables 355B parameters with only 32B active per token
Single-Node PEFT: GLM 4.5 Air can be fine-tuned with LoRA/DoRA on just 8 GPUs
Long Context: Native 131K token context window with optimized RoPE
Multi-Token Prediction: Faster convergence through MTP training
Flexible Deployment: Multiple parallelism strategies for different hardware configurations
Load Balancing: Shared expert overlap and auxiliary loss for optimal expert utilization