bridge.models.ernie_vl.modeling_ernie45_vl.vision_transformer_config#
TransformerConfig for the ERNIE 4.5 VL vision encoder (DFN-style ViT with 2D RoPE).
This config inherits from Megatron-Core’s TransformerConfig and adds
vision-specific fields (patch_size, spatial_merge_size, etc.). It is
constructed from the HF vision config via get_ernie_vision_config().
Module Contents#
Classes#
TransformerConfig for ERNIE 4.5 VL vision encoder. |
Functions#
Quick GELU activation: x * sigmoid(1.702 * x). |
|
Construct an ErnieVisionTransformerConfig from a HF vision config. |
API#
- class bridge.models.ernie_vl.modeling_ernie45_vl.vision_transformer_config.ErnieVisionTransformerConfig#
Bases:
megatron.core.transformer.transformer_config.TransformerConfigTransformerConfig for ERNIE 4.5 VL vision encoder.
Extends Megatron-Core TransformerConfig with ERNIE vision-specific fields.
Architecture constants from HF DFNRopeVisionTransformerConfig: embed_dim=1280, depth=32, num_heads=16, mlp_ratio=4, patch_size=14, in_channels=3, spatial_merge_size=2, hidden_act=”quick_gelu”
- patch_size: int#
14
Vision patch size (pixels per side).
- in_channels: int#
3
Number of input image channels.
- spatial_merge_size: int#
2
Spatial merge factor for the resampler (2x2 pooling).
- bridge.models.ernie_vl.modeling_ernie45_vl.vision_transformer_config._quick_gelu(x)#
Quick GELU activation: x * sigmoid(1.702 * x).
This is the activation function used by ERNIE 4.5 VL ViT (and OpenAI CLIP). It is a fast approximation of GELU but is NOT equivalent to
F.gelu(x, approximate="tanh"), which uses a different formula: 0.5 * x * (1 + tanh(sqrt(2/pi) * (x + 0.044715 * x^3))) The two differ by up to ~2% per element.
- bridge.models.ernie_vl.modeling_ernie45_vl.vision_transformer_config.get_ernie_vision_config(
- hf_vision_config,
- megatron_config=None,
Construct an ErnieVisionTransformerConfig from a HF vision config.
- Parameters:
hf_vision_config – HF DFNRopeVisionTransformerConfig or equivalent with fields: embed_dim, depth, num_heads, mlp_ratio, patch_size, in_channels, spatial_merge_size, hidden_act.
megatron_config – Optional language model TransformerConfig to copy recompute / CUDA-graph / TP settings from.
- Returns:
ErnieVisionTransformerConfig ready for ErnieVLVisionModel.