GLM-4.5V#

GLM-4.5V is a powerful vision-language model built on the GLM-4.5 Air architecture. GLM-4.5V combines a 106B parameter sparse MoE language model with a vision encoder for robust multimodal understanding of images and videos.

GLM-4.5V supports multimodal tasks including image captioning, visual question answering, OCR, video understanding, and general vision-language reasoning. The model leverages Multi-Resolution Rotary Position Embedding (MRoPE) for enhanced spatial understanding.

GLM family models are supported via the Bridge system with auto-detected configuration and weight mapping.

Important

Please update transformers version to 4.57.1 or higher in order to use the GLM-4.5V model.

Available Models#

Vision-Language Models#

GLM-4.5V (zai-org/GLM-4.5V): 106B parameter vision-language model (based on GLM-4.5 Air)
- 46 decoder layers, 4096 hidden size
- 96 attention heads, 8 query groups (GQA)
- 128 MoE experts with shared experts
- ~12B active parameters per token
- Sequence length: 131,072 tokens
- Recommended: 32 nodes, 256 GPUs (LoRA/DoRA) or 64 nodes, 512 GPUs (Full SFT)

Model Architecture Features#

GLM-4.5V combines efficient sparse MoE language modeling with multimodal capabilities:

Language Model Features:

Sparse MoE Architecture: 128 routed experts with shared experts for efficient parameter usage
Grouped Query Attention (GQA): Memory-efficient attention with 8 query groups
SiLU Gated Linear Unit: Gated linear units with SiLU activation for improved performance
RMSNorm: Layer normalization without mean centering for faster computation
Multi-Resolution RoPE (MRoPE): Enhanced position embeddings with sections [8, 12, 12] for improved spatial understanding
Extended Context: Supports up to 131,072 tokens

Vision-Language Features:

Vision Encoder: Pre-trained vision encoder for robust visual understanding
Multimodal Integration: Seamless integration of visual and textual information
Image and Video Support: Handles both static images and video inputs
Flexible Image Handling: Supports variable resolution images and multiple images per conversation

Examples#

For checkpoint conversion, inference, finetuning recipes, and step-by-step training guides, see the GLM-4.5V Examples.

Hugging Face Model Cards#

GLM-4.5V: https://huggingface.co/zai-org/GLM-4.5V

GLM-4.5V#

Available Models#

Vision-Language Models#

Model Architecture Features#

Examples#

Hugging Face Model Cards#

Related Docs#