bridge.models.bailing.modeling_bailing_moe_v2#

PyTorch BailingMoE model.

Module Contents#

Classes#

MoEV2CausalLMOutputWithPast

Base class for causal language model (or autoregressive) outputs as well as Mixture of Expert’s router hidden states terms, to train a MoE model.

MoeV2ModelOutputWithPast

BailingMoeV2RMSNorm

BailingMoeV2RotaryEmbedding

BailingMoeV2MLP

BailingMoeV2Gate

BailingMoeV2SparseMoeBlock

A mixed expert module containing shared experts.

BailingMoeV2Attention

Multi-headed attention from ‘Attention Is All You Need’ paper

BailingMoeV2FlashAttention2

BailingMoeV2 flash attention module. This module inherits from BailingMoeV2Attention as the weights of the module stays untouched. The only required change would be on the forward pass where it needs to correctly call the public API of flash attention and deal with padding tokens in case the input contains any of them.

BailingMoeV2SdpaAttention

BailingMoeV2 attention module using torch.nn.functional.scaled_dot_product_attention. This module inherits from BailingMoeV2Attention as the weights of the module stays untouched. The only changes are on the forward pass to adapt to SDPA API.

BailingMoeV2MTPLayer

BailingMoeV2DecoderLayer

BailingMoeV2PreTrainedModel

BailingMoeV2Model

Transformer decoder consisting of config.num_hidden_layers layers. Each layer is a [BailingMoeV2DecoderLayer]

BailingMoeV2ForCausalLM

Functions#

roll_tensor

Roll the tensor input along the given dimension(s). Inserted elements are set to be 0.0.

_get_unpad_data

_expand_mask

_make_causal_mask

_default_rope_init_fn

Fallback RoPE initialization for models without a specific scaling type.

rotate_half

Rotates half the hidden dims of the input.

apply_rotary_pos_emb

Applies Rotary Position Embedding to the query and key tensors.

repeat_kv

This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch, num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)

Data#

API#

bridge.models.bailing.modeling_bailing_moe_v2.logger#

‘get_logger(…)’

bridge.models.bailing.modeling_bailing_moe_v2._CONFIG_FOR_DOC#

‘BailingMoeV2Config’

bridge.models.bailing.modeling_bailing_moe_v2.roll_tensor(tensor, shifts=-1, dims=-1, fill_value=0)#

Roll the tensor input along the given dimension(s). Inserted elements are set to be 0.0.

class bridge.models.bailing.modeling_bailing_moe_v2.MoEV2CausalLMOutputWithPast#

Bases: transformers.utils.ModelOutput

Base class for causal language model (or autoregressive) outputs as well as Mixture of Expert’s router hidden states terms, to train a MoE model.

Parameters:
  • loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) – Language modeling loss (for next-token prediction).

  • logits (torch.FloatTensor of shape (batch_size, sequence_length, config.vocab_size)) – Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

  • past_key_values (Cache, optional, returned when use_cache=True is passed or when config.use_cache=True) –

    It is a [~cache_utils.Cache] instance. For more details, see our kv cache guide.

    Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

  • hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) –

    Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

  • attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) –

    Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

  • z_loss (torch.FloatTensor, optional, returned when labels is provided) – z_loss for the sparse modules.

  • aux_loss (torch.FloatTensor, optional, returned when labels is provided) – aux_loss for the sparse modules.

  • router_logits (tuple(torch.FloatTensor), optional, returned when output_router_logits=True is passed or when config.add_router_probs=True) –

    Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, sequence_length, num_experts).

    Router logits of the encoder model, useful to compute the auxiliary loss and the z_loss for the sparse modules.

loss: Optional[torch.FloatTensor]#

None

logits: Optional[torch.FloatTensor]#

None

past_key_values: Optional[transformers.cache_utils.Cache]#

None

hidden_states: Optional[tuple[torch.FloatTensor, ...]]#

None

attentions: Optional[tuple[torch.FloatTensor, ...]]#

None

z_loss: Optional[torch.FloatTensor]#

None

aux_loss: Optional[torch.FloatTensor]#

None

router_logits: Optional[tuple[torch.FloatTensor]]#

None

mtp_loss: Optional[torch.FloatTensor]#

None

mtp_logits: Optional[tuple[torch.FloatTensor, ...]]#

None

class bridge.models.bailing.modeling_bailing_moe_v2.MoeV2ModelOutputWithPast(mtp_hidden_states=None, **kwargs)#

Bases: transformers.modeling_outputs.MoeModelOutputWithPast

Initialization

bridge.models.bailing.modeling_bailing_moe_v2._get_unpad_data(attention_mask)#
bridge.models.bailing.modeling_bailing_moe_v2._expand_mask(
mask: torch.Tensor,
dtype: torch.dtype,
tgt_len: Optional[int] = None,
)#
bridge.models.bailing.modeling_bailing_moe_v2._make_causal_mask(
input_ids_shape: torch.Size,
dtype: torch.dtype,
device: torch.device,
past_key_values_length: int = 0,
)#
class bridge.models.bailing.modeling_bailing_moe_v2.BailingMoeV2RMSNorm(hidden_size, eps=1e-06)#

Bases: torch.nn.Module

Initialization

BailingMoeV2RMSNorm is equivalent to T5LayerNorm

forward(hidden_states)#
bridge.models.bailing.modeling_bailing_moe_v2._default_rope_init_fn(config, device=None)#

Fallback RoPE initialization for models without a specific scaling type.

Provides standard (non-scaled) RoPE initialisation when ‘default’ is not present in ROPE_INIT_FUNCTIONS (removed in transformers >= 4.x).

class bridge.models.bailing.modeling_bailing_moe_v2.BailingMoeV2RotaryEmbedding(
config: megatron.bridge.models.bailing.configuration_bailing_moe_v2.BailingMoeV2Config,
device=None,
)#

Bases: torch.nn.Module

Initialization

forward(x, position_ids)#
bridge.models.bailing.modeling_bailing_moe_v2.rotate_half(x)#

Rotates half the hidden dims of the input.

bridge.models.bailing.modeling_bailing_moe_v2.apply_rotary_pos_emb(q, k, cos, sin, unsqueeze_dim=1)#

Applies Rotary Position Embedding to the query and key tensors.

Parameters:
  • q (torch.Tensor) – The query tensor.

  • k (torch.Tensor) – The key tensor.

  • cos (torch.Tensor) – The cosine part of the rotary embedding.

  • sin (torch.Tensor) – The sine part of the rotary embedding.

  • unsqueeze_dim (int, optional, defaults to 1) – The ‘unsqueeze_dim’ argument specifies the dimension along which to unsqueeze cos[position_ids] and sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.

Returns:

tuple(torch.Tensor) comprising the query and key tensors rotated using the Rotary Position Embedding.

class bridge.models.bailing.modeling_bailing_moe_v2.BailingMoeV2MLP(
config: megatron.bridge.models.bailing.configuration_bailing_moe_v2.BailingMoeV2Config,
intermediate_size: int,
)#

Bases: torch.nn.Module

Initialization

forward(x)#
class bridge.models.bailing.modeling_bailing_moe_v2.BailingMoeV2Gate(config)#

Bases: torch.nn.Module

Initialization

reset_parameters() None#
group_limited_topk(scores: torch.Tensor)#
forward(hidden_states)#
class bridge.models.bailing.modeling_bailing_moe_v2.BailingMoeV2SparseMoeBlock(
config: megatron.bridge.models.bailing.configuration_bailing_moe_v2.BailingMoeV2Config,
)#

Bases: torch.nn.Module

A mixed expert module containing shared experts.

Initialization

_setup_experts()#
forward(hidden_states)#
moe_infer(x, topk_ids, topk_weight)#
bridge.models.bailing.modeling_bailing_moe_v2.repeat_kv(hidden_states: torch.Tensor, n_rep: int) torch.Tensor#

This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch, num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)

class bridge.models.bailing.modeling_bailing_moe_v2.BailingMoeV2Attention(
config: megatron.bridge.models.bailing.configuration_bailing_moe_v2.BailingMoeV2Config,
layer_idx: Optional[int] = None,
)#

Bases: torch.nn.Module

Multi-headed attention from ‘Attention Is All You Need’ paper

Initialization

_shape(tensor: torch.Tensor, seq_len: int, bsz: int)#
forward(
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[transformers.cache_utils.Cache] = None,
output_attentions: bool = False,
use_cache: bool = False,
position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
**kwargs,
) Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]#
class bridge.models.bailing.modeling_bailing_moe_v2.BailingMoeV2FlashAttention2(*args, **kwargs)#

Bases: bridge.models.bailing.modeling_bailing_moe_v2.BailingMoeV2Attention

BailingMoeV2 flash attention module. This module inherits from BailingMoeV2Attention as the weights of the module stays untouched. The only required change would be on the forward pass where it needs to correctly call the public API of flash attention and deal with padding tokens in case the input contains any of them.

Initialization

forward(
hidden_states: torch.Tensor,
attention_mask: Optional[torch.LongTensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[transformers.cache_utils.Cache] = None,
output_attentions: bool = False,
use_cache: bool = False,
position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
**kwargs,
) Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]#
_flash_attention_forward(
query_states,
key_states,
value_states,
attention_mask,
query_length,
dropout=0.0,
softmax_scale=None,
)#

Calls the forward method of Flash Attention - if the input hidden states contain at least one padding token first unpad the input, then computes the attention scores and pad the final attention scores.

Parameters:
  • query_states (torch.Tensor) – Input query states to be passed to Flash Attention API

  • key_states (torch.Tensor) – Input key states to be passed to Flash Attention API

  • value_states (torch.Tensor) – Input value states to be passed to Flash Attention API

  • attention_mask (torch.Tensor) – The padding mask - corresponds to a tensor of size (batch_size, seq_len) where 0 stands for the position of padding tokens and 1 for the position of non-padding tokens.

  • dropout (int, optional) – Attention dropout

  • softmax_scale (float, optional) – The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)

  • query_length (int) – The length of the query sequence in terms of tokens. This represents the number of tokens in the query_states tensor along the sequence dimension. It is used to determine the effective sequence length for attention computations.

_upad_input(
query_layer,
key_layer,
value_layer,
attention_mask,
query_length,
)#
class bridge.models.bailing.modeling_bailing_moe_v2.BailingMoeV2SdpaAttention(
config: megatron.bridge.models.bailing.configuration_bailing_moe_v2.BailingMoeV2Config,
layer_idx: Optional[int] = None,
)#

Bases: bridge.models.bailing.modeling_bailing_moe_v2.BailingMoeV2Attention

BailingMoeV2 attention module using torch.nn.functional.scaled_dot_product_attention. This module inherits from BailingMoeV2Attention as the weights of the module stays untouched. The only changes are on the forward pass to adapt to SDPA API.

Initialization

forward(
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[transformers.cache_utils.Cache] = None,
output_attentions: bool = False,
use_cache: bool = False,
position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
**kwargs,
) Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]#
bridge.models.bailing.modeling_bailing_moe_v2.ATTENTION_CLASSES#

None

class bridge.models.bailing.modeling_bailing_moe_v2.BailingMoeV2MTPLayer(
config: megatron.bridge.models.bailing.configuration_bailing_moe_v2.BailingMoeV2Config,
layer_idx: int,
)#

Bases: torch.nn.Module

Initialization

forward(
input_embeds,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Tuple[torch.Tensor]] = None,
output_attentions: Optional[bool] = False,
output_router_logits: Optional[bool] = False,
use_cache: Optional[bool] = False,
position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
**kwargs,
) Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]#
class bridge.models.bailing.modeling_bailing_moe_v2.BailingMoeV2DecoderLayer(
config: megatron.bridge.models.bailing.configuration_bailing_moe_v2.BailingMoeV2Config,
layer_idx: int,
)#

Bases: torch.nn.Module

Initialization

forward(
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Tuple[torch.Tensor]] = None,
output_attentions: Optional[bool] = False,
output_router_logits: Optional[bool] = False,
use_cache: Optional[bool] = False,
position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
**kwargs,
) Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]#
Parameters:
  • hidden_states (torch.FloatTensor) – input to the layer of shape (batch, seq_len, embed_dim)

  • attention_mask (torch.FloatTensor, optional) – attention mask of size (batch_size, sequence_length) if flash attention is used or (batch_size, 1, query_sequence_length, key_sequence_length) if default attention is used.

  • position_ids (torch.LongTensor of shape (batch_size, sequence_length), optional) – Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1].

  • past_key_value (Tuple(torch.FloatTensor), optional) – cached past key and value projection states

  • output_attentions (bool, optional) – Whether to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.

  • output_router_logits (bool, optional) – Whether or not to return the logits of all the routers. They are useful for computing the router loss, and should not be returned during inference.

  • use_cache (bool, optional) – If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).

bridge.models.bailing.modeling_bailing_moe_v2.BAILINGMOEV2_START_DOCSTRING = <Multiline-String>#
class bridge.models.bailing.modeling_bailing_moe_v2.BailingMoeV2PreTrainedModel#

Bases: transformers.modeling_utils.PreTrainedModel

config_class#

None

base_model_prefix#

‘model’

supports_gradient_checkpointing#

True

_no_split_modules#

[‘BailingMoeV2DecoderLayer’]

_skip_keys_device_placement#

‘past_key_values’

_supports_flash_attn_2#

True

_supports_sdpa#

True

_supports_cache_class#

True

_init_weights(module)#
bridge.models.bailing.modeling_bailing_moe_v2.BAILINGMOEV2_INPUTS_DOCSTRING = <Multiline-String>#
class bridge.models.bailing.modeling_bailing_moe_v2.BailingMoeV2Model(
config: megatron.bridge.models.bailing.configuration_bailing_moe_v2.BailingMoeV2Config,
)#

Bases: bridge.models.bailing.modeling_bailing_moe_v2.BailingMoeV2PreTrainedModel

Transformer decoder consisting of config.num_hidden_layers layers. Each layer is a [BailingMoeV2DecoderLayer]

Parameters:

config – BailingMoeV2Config

Initialization

get_input_embeddings()#
set_input_embeddings(value)#
forward(
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[List[torch.FloatTensor]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
output_router_logits: Optional[bool] = None,
return_dict: Optional[bool] = None,
**kwargs,
) Union[Tuple, bridge.models.bailing.modeling_bailing_moe_v2.MoeV2ModelOutputWithPast]#
class bridge.models.bailing.modeling_bailing_moe_v2.BailingMoeV2ForCausalLM(
config: megatron.bridge.models.bailing.configuration_bailing_moe_v2.BailingMoeV2Config,
)#

Bases: bridge.models.bailing.modeling_bailing_moe_v2.BailingMoeV2PreTrainedModel, transformers.generation.utils.GenerationMixin

Initialization

_tied_weights_keys#

None

get_input_embeddings()#
set_input_embeddings(value)#
get_output_embeddings()#
set_output_embeddings(new_embeddings)#
set_decoder(decoder)#
get_decoder()#
forward(
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[List[torch.FloatTensor]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
output_router_logits: Optional[bool] = None,
return_dict: Optional[bool] = None,
**kwargs,
) Union[Tuple, bridge.models.bailing.modeling_bailing_moe_v2.MoEV2CausalLMOutputWithPast]#
Parameters:

labels (torch.LongTensor of shape (batch_size, sequence_length), optional) – Labels for computing the masked language modeling loss. Indices should either be in [0, ..., config.vocab_size] or -100 (see input_ids docstring). Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size].

Returns:

Example:

>>> from transformers import AutoTokenizer

>>> model = BailingMoeV2ForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
>>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)

>>> prompt = "Hey, are you conscious? Can you talk to me?"
>>> inputs = tokenizer(prompt, return_tensors="pt")

>>> # Generate
>>> generate_ids = model.generate(inputs.input_ids, max_length=30)
>>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
"Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."