bridge.models.ernie_vl.modeling_ernie45_vl.vision_transformer_config#

TransformerConfig for the ERNIE 4.5 VL vision encoder (DFN-style ViT with 2D RoPE).

This config inherits from Megatron-Core’s TransformerConfig and adds vision-specific fields (patch_size, spatial_merge_size, etc.). It is constructed from the HF vision config via get_ernie_vision_config().

Module Contents#

Classes#

ErnieVisionTransformerConfig

TransformerConfig for ERNIE 4.5 VL vision encoder.

Functions#

_quick_gelu

Quick GELU activation: x * sigmoid(1.702 * x).

get_ernie_vision_config

Construct an ErnieVisionTransformerConfig from a HF vision config.

API#

class bridge.models.ernie_vl.modeling_ernie45_vl.vision_transformer_config.ErnieVisionTransformerConfig#

Bases: megatron.core.transformer.transformer_config.TransformerConfig

TransformerConfig for ERNIE 4.5 VL vision encoder.

Extends Megatron-Core TransformerConfig with ERNIE vision-specific fields.

Architecture constants from HF DFNRopeVisionTransformerConfig: embed_dim=1280, depth=32, num_heads=16, mlp_ratio=4, patch_size=14, in_channels=3, spatial_merge_size=2, hidden_act=”quick_gelu”

patch_size: int#

14

Vision patch size (pixels per side).

in_channels: int#

3

Number of input image channels.

spatial_merge_size: int#

2

Spatial merge factor for the resampler (2x2 pooling).

bridge.models.ernie_vl.modeling_ernie45_vl.vision_transformer_config._quick_gelu(x)#

Quick GELU activation: x * sigmoid(1.702 * x).

This is the activation function used by ERNIE 4.5 VL ViT (and OpenAI CLIP). It is a fast approximation of GELU but is NOT equivalent to F.gelu(x, approximate="tanh"), which uses a different formula: 0.5 * x * (1 + tanh(sqrt(2/pi) * (x + 0.044715 * x^3))) The two differ by up to ~2% per element.

bridge.models.ernie_vl.modeling_ernie45_vl.vision_transformer_config.get_ernie_vision_config(
hf_vision_config,
megatron_config=None,
) bridge.models.ernie_vl.modeling_ernie45_vl.vision_transformer_config.ErnieVisionTransformerConfig#

Construct an ErnieVisionTransformerConfig from a HF vision config.

Parameters:
  • hf_vision_config – HF DFNRopeVisionTransformerConfig or equivalent with fields: embed_dim, depth, num_heads, mlp_ratio, patch_size, in_channels, spatial_merge_size, hidden_act.

  • megatron_config – Optional language model TransformerConfig to copy recompute / CUDA-graph / TP settings from.

Returns:

ErnieVisionTransformerConfig ready for ErnieVLVisionModel.