bridge.models.stepfun.step35_bridge#

Module Contents#

Classes#

StackedExpertAutoMapping

Maps Megatron per-expert weight{i} ↔ HF stacked expert tensor[i].

StackedExpertGatedMLPMapping

GatedMLPMapping for per-expert Megatron weights backed by HF stacked tensors.

_MTPDenseLayerSpecsList

List of per-decoder-layer specs that returns a dense spec on negative-index access.

Step35Bridge

Megatron Bridge for Step3.5 Causal LM.

Functions#

_build_step35_layer_spec

Per-layer spec for Step3.5: dense for layers 0-2 and 45-47, MoE for 3-44.

Data#

API#

bridge.models.stepfun.step35_bridge.logger#

‘getLogger(…)’

class bridge.models.stepfun.step35_bridge.StackedExpertAutoMapping#

Bases: megatron.bridge.models.conversion.param_mapping.AutoMapping

Maps Megatron per-expert weight{i} ↔ HF stacked expert tensor[i].

Step3.5 HF stores all experts in a single stacked tensor, e.g. model.layers.*.moe.down_proj.weight with shape [num_experts, H, I]. Megatron creates individual per-expert tensors named weight0, weight1, …

The megatron_param uses a trailing weight* wildcard to match these names; hf_param has one fewer wildcard (no expert index in the path). During wildcard resolution _resolve_names resets capture_index to 0 for the HF side, so hf_param only consumes the layer-index capture and the expert-index capture is available to slice the stacked tensor in hf_to_megatron.

is_grouped_export#

True

_expert_idx() int#
hf_to_megatron(
hf_weights: torch.Tensor,
megatron_module,
) torch.Tensor#
class bridge.models.stepfun.step35_bridge.StackedExpertGatedMLPMapping#

Bases: megatron.bridge.models.conversion.param_mapping.GatedMLPMapping

GatedMLPMapping for per-expert Megatron weights backed by HF stacked tensors.

HF stores all experts’ gate/up projections as stacked tensors with shape [num_experts, I, H]. Megatron creates individual per-expert linear_fc1.weight{i} tensors (shape [2*I, H], gate+up fused).

megatron_param uses a trailing weight* wildcard. gate / up each have one fewer wildcard (no expert index in the HF path). During wildcard resolution _resolve_names resets capture_index for every dict key, so both gate/up only consume the layer-index capture.

is_grouped_export#

True

_expert_idx() int#
hf_to_megatron(
hf_weights: Dict[str, torch.Tensor],
megatron_module,
) torch.Tensor#
class bridge.models.stepfun.step35_bridge._MTPDenseLayerSpecsList(data, dense_mtp_spec)#

Bases: list

List of per-decoder-layer specs that returns a dense spec on negative-index access.

get_gpt_mtp_block_spec_for_backend reads spec.layer_specs[-1] to decide which layer type the MTP transformer sub-layers should use. For Step3.5 the last decoder layer (layer 44) is MoE, but MTP layers 45-47 are NOT in moe_layers_enum and must be dense.

Overriding __getitem__ for negative indices intercepts only that single look-up while leaving normal forward iteration (used by TransformerBlock to instantiate the 45 main decoder layers) completely unaffected — CPython’s list iterator operates on the internal C array directly, bypassing __getitem__.

Initialization

Initialize self. See help(type(self)) for accurate signature.

__getitem__(idx)#
bridge.models.stepfun.step35_bridge._build_step35_layer_spec(cfg, **kw)#

Per-layer spec for Step3.5: dense for layers 0-2 and 45-47, MoE for 3-44.

Also rewrites every main-decoder layer’s ModuleSpec to use Step35DecoderLayer instead of the default TransformerLayer. The custom layer reads cfg.layer_types at init time to determine whether the layer is a sliding-attention layer.

Returns a TransformerBlockSubmodules whose layer_specs list is wrapped in _MTPDenseLayerSpecsList so that get_gpt_mtp_block_spec_for_backend receives a dense ModuleSpec (via layer_specs[-1]) for the MTP transformer sub-layers.

class bridge.models.stepfun.step35_bridge.Step35Bridge#

Bases: megatron.bridge.models.conversion.model_bridge.MegatronModelBridge

Megatron Bridge for Step3.5 Causal LM.

This bridge handles the conversion between HuggingFace Step3p5ForCausalLM (the HF architecture name; preserved verbatim to match the upstream config.json) and Megatron-Core GPTModel formats. Step3.5 models use mixture of experts architecture with QK layernorm.

.. rubric:: Example

from megatron.bridge import AutoBridge bridge = AutoBridge.from_hf_pretrained(“stepfun-ai/Step-3.5-Flash”) provider = bridge.to_megatron_provider()

CONFIG_MAPPING#

None

provider_bridge(
hf_pretrained: megatron.bridge.models.hf_pretrained.causal_lm.PreTrainedCausalLM,
) megatron.bridge.models.gpt_provider.GPTModelProvider#

Convert HuggingFace Step3.5 config to GPTModelProvider.

mapping_registry() megatron.bridge.models.conversion.mapping_registry.MegatronMappingRegistry#