`bridge.models.stepfun.step35_bridge`#

Module Contents#

Classes#

`StackedExpertAutoMapping`	Maps Megatron per-expert weight{i} ↔ HF stacked expert tensor[i].
`StackedExpertGatedMLPMapping`	GatedMLPMapping for per-expert Megatron weights backed by HF stacked tensors.
`_MTPDenseLayerSpecsList`	List of per-decoder-layer specs that returns a dense spec on negative-index access.
`Step35Bridge`	Megatron Bridge for Step3.5 Causal LM.

Functions#

`_mcore_supports_head_wise_attn_gate`
`_build_step35_layer_spec`	Per-layer spec for Step3.5: dense for layers 0-2 and 45-47, MoE for 3-44.

Data#

logger

API#

bridge.models.stepfun.step35_bridge.logger#: ‘getLogger(…)’

bridge.models.stepfun.step35_bridge._mcore_supports_head_wise_attn_gate() → bool#

class bridge.models.stepfun.step35_bridge.StackedExpertAutoMapping#

Bases: megatron.bridge.models.conversion.param_mapping.AutoMapping

Maps Megatron per-expert weight{i} ↔ HF stacked expert tensor[i].

Step3.5 HF stores all experts in a single stacked tensor, e.g. model.layers.*.moe.down_proj.weight with shape [num_experts, H, I]. Megatron creates individual per-expert tensors named weight0, weight1, …

The megatron_param uses a trailing weight* wildcard to match these names; hf_param has one fewer wildcard (no expert index in the path). During wildcard resolution _resolve_names resets capture_index to 0 for the HF side, so hf_param only consumes the layer-index capture and the expert-index capture is available to slice the stacked tensor in hf_to_megatron.

is_grouped_export#: True

_expert_idx() → int#

hf_to_megatron( hf_weights: torch.Tensor, megatron_module, ) → torch.Tensor#

class bridge.models.stepfun.step35_bridge.StackedExpertGatedMLPMapping#

Bases: megatron.bridge.models.conversion.param_mapping.GatedMLPMapping

GatedMLPMapping for per-expert Megatron weights backed by HF stacked tensors.

HF stores all experts’ gate/up projections as stacked tensors with shape [num_experts, I, H]. Megatron creates individual per-expert linear_fc1.weight{i} tensors (shape [2*I, H], gate+up fused).

megatron_param uses a trailing weight* wildcard. gate / up each have one fewer wildcard (no expert index in the HF path). During wildcard resolution _resolve_names resets capture_index for every dict key, so both gate/up only consume the layer-index capture.

is_grouped_export#: True

_expert_idx() → int#

hf_to_megatron( hf_weights: Dict[str, torch.Tensor], megatron_module, ) → torch.Tensor#

class bridge.models.stepfun.step35_bridge._MTPDenseLayerSpecsList(data, dense_mtp_spec)#

Bases: list

List of per-decoder-layer specs that returns a dense spec on negative-index access.

get_gpt_mtp_block_spec_for_backend reads spec.layer_specs[-1] to decide which layer type the MTP transformer sub-layers should use. For Step3.5 the last decoder layer (layer 44) is MoE, but MTP layers 45-47 are NOT in moe_layers_enum and must be dense.

Overriding __getitem__ for negative indices intercepts only that single look-up while leaving normal forward iteration (used by TransformerBlock to instantiate the 45 main decoder layers) completely unaffected — CPython’s list iterator operates on the internal C array directly, bypassing __getitem__.

Initialization

Initialize self. See help(type(self)) for accurate signature.

__getitem__(idx)#

bridge.models.stepfun.step35_bridge._build_step35_layer_spec(cfg, **kw)#

Per-layer spec for Step3.5: dense for layers 0-2 and 45-47, MoE for 3-44.

Also rewrites every main-decoder layer’s ModuleSpec to use Step35DecoderLayer instead of the default TransformerLayer. The custom layer reads cfg.layer_types at init time to determine whether the layer is a sliding-attention layer.

Returns a TransformerBlockSubmodules whose layer_specs list is wrapped in _MTPDenseLayerSpecsList so that get_gpt_mtp_block_spec_for_backend receives a dense ModuleSpec (via layer_specs[-1]) for the MTP transformer sub-layers.

class bridge.models.stepfun.step35_bridge.Step35Bridge#

Bases: megatron.bridge.models.conversion.model_bridge.MegatronModelBridge

Megatron Bridge for Step3.5 Causal LM.

This bridge handles the conversion between HuggingFace Step3p5ForCausalLM (the HF architecture name; preserved verbatim to match the upstream config.json) and Megatron-Core GPTModel formats. Step3.5 models use mixture of experts architecture with QK layernorm.

.. rubric:: Example

from megatron.bridge import AutoBridge bridge = AutoBridge.from_hf_pretrained(“stepfun-ai/Step-3.5-Flash”) provider = bridge.to_megatron_provider()

CONFIG_MAPPING#: None

provider_bridge( hf_pretrained: megatron.bridge.models.hf_pretrained.causal_lm.PreTrainedCausalLM, ) → megatron.bridge.models.gpt_provider.GPTModelProvider#

Convert HuggingFace Step3.5 config to GPTModelProvider.

Layered field-extraction strategy (mirrors the qwen3-vl bridge pattern):

Common architectural fields — super().provider_bridge internally calls :meth:hf_config_to_provider_kwargs, which uses hasattr + getattr(..., None) against :attr:CONFIG_MAPPING and silently skips fields the HF config doesn’t carry. It also sets provider.position_embedding_type = "rope" (or "yarn") based on rope_scaling — without that, rotary_base_per_layer later collides with the dataclass default "learned_absolute".
Step-3.5-specific fields — applied below with explicit getattr(hf_config, name, default) for every field. This used to be 13 bare hf_config.X reads, which crashed when the bridge was reused via a wrapper against a Step3.7 text_config that dropped zero_centered. The getattr form makes Step35Bridge safe to reuse against any HF config schema that may be missing Step-3.5-specific top-level fields, and the per-field defaults below are documented so the call site doesn’t silently fall back to wrong values.

mapping_registry() → megatron.bridge.models.conversion.mapping_registry.MegatronMappingRegistry#

bridge.models.stepfun.step35_bridge#