bridge.models.stepfun.configuration_step35#
Module Contents#
Classes#
Configuration for the Step-3.5-Flash ( |
API#
- class bridge.models.stepfun.configuration_step35.Step35Config(
- hidden_size: int = 4096,
- intermediate_size: int = 11264,
- num_attention_heads: int = 64,
- num_attention_groups: int = 8,
- num_hidden_layers: int = 45,
- max_seq_len: int = 128000,
- vocab_size: int = 128815,
- rms_norm_eps: float = 1e-05,
- moe_intermediate_size: int = 1280,
- moe_num_experts: int = 288,
- moe_top_k: int = 8,
- rope_theta: float = 10000,
- rope_scaling: Optional[dict[str, Any]] = None,
- max_position_embeddings: int = 128000,
- share_expert_dims: int = 1280,
- share_expert_dim: int | None = None,
- head_dim: int = 128,
- norm_expert_weight: bool = True,
- layer_types: list[str] | None = None,
- attention_other_setting: dict[str, Any] | None = None,
- use_head_wise_attn_gate: bool = True,
- sliding_window: Optional[int] = None,
- num_nextn_predict_layers: int = 3,
- moe_layers_enum: tuple[int] = (3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44),
- **kwargs,
Bases:
transformers.configuration_utils.PretrainedConfigConfiguration for the Step-3.5-Flash (
Step35) Mixture-of-Experts model.Step35 is a decoder-only causal language model with grouped-query attention, rotary positional embeddings, and a configurable subset of transformer layers replaced by MoE FFN blocks (the remaining layers stay dense). It mirrors the architecture published under
stepfun-ai/Step-3.5-Flashon the Hugging Face Hub.- Parameters:
hidden_size β Dimensionality of the hidden states.
intermediate_size β Dimensionality of the dense FFN intermediate states (used by non-MoE layers).
num_attention_heads β Number of query heads.
num_attention_groups β Number of key/value head groups for GQA.
num_hidden_layers β Total number of transformer layers.
max_seq_len β Maximum sequence length supported by the model.
vocab_size β Size of the tokenizer vocabulary.
rms_norm_eps β Epsilon used by RMSNorm.
moe_intermediate_size β Per-expert FFN intermediate size inside MoE layers.
moe_num_experts β Number of routed experts per MoE layer.
moe_top_k β Number of experts each token is routed to.
rope_theta β Base period of the rotary embeddings.
rope_scaling β Optional RoPE scaling configuration dict.
max_position_embeddings β Maximum positions supported by RoPE.
share_expert_dims β Hidden size of the shared-expert branch that runs alongside the routed experts.
share_expert_dim β Singular alias used by the published Step-3.5-Flash config.
head_dim β Per-head attention dimension.
norm_expert_weight β Whether to normalize the top-k expert routing weights so they sum to 1.
layer_types β Optional per-layer type override (e.g. attention variant);
Noneuses the default for every layer.attention_other_setting β Sliding-attention shape override from the HF config.
use_head_wise_attn_gate β Whether Step-3.5 uses per-head
g_projgates fused into QKV.sliding_window β Sliding-window size for windowed attention;
Nonedisables windowing.num_nextn_predict_layers β Number of MTP layers appended after the main decoder.
moe_layers_enum β Indices of layers that use the MoE FFN. Layers not listed here use the dense FFN of
intermediate_size.**kwargs β Forwarded to :class:
~transformers.PretrainedConfig.
.. note::
model_typeandarchitecturesdeliberately keep thestep3p5/Step3p5ForCausalLMspelling so this config stays compatible with theconfig.jsonshipped onstepfun-ai/Step-3.5-Flash. Only the Python class name uses theStep35spelling internally.Initialization
- model_type#
βstep3p5β
- architectures#
[βStep3p5ForCausalLMβ]