bridge.models.bailing.modeling_bailing_moe_v2#
PyTorch BailingMoE model.
Module Contents#
Classes#
Base class for causal language model (or autoregressive) outputs as well as Mixture of Expert’s router hidden states terms, to train a MoE model. |
|
A mixed expert module containing shared experts. |
|
Multi-headed attention from ‘Attention Is All You Need’ paper |
|
BailingMoeV2 flash attention module. This module inherits from |
|
BailingMoeV2 attention module using torch.nn.functional.scaled_dot_product_attention. This module inherits from
|
|
Transformer decoder consisting of config.num_hidden_layers layers. Each layer is a [ |
|
Functions#
Roll the tensor input along the given dimension(s). Inserted elements are set to be 0.0. |
|
Fallback RoPE initialization for models without a specific scaling type. |
|
Rotates half the hidden dims of the input. |
|
Applies Rotary Position Embedding to the query and key tensors. |
|
This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch, num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim) |
Data#
API#
- bridge.models.bailing.modeling_bailing_moe_v2.logger#
‘get_logger(…)’
- bridge.models.bailing.modeling_bailing_moe_v2._CONFIG_FOR_DOC#
‘BailingMoeV2Config’
- bridge.models.bailing.modeling_bailing_moe_v2.roll_tensor(tensor, shifts=-1, dims=-1, fill_value=0)#
Roll the tensor input along the given dimension(s). Inserted elements are set to be 0.0.
- class bridge.models.bailing.modeling_bailing_moe_v2.MoEV2CausalLMOutputWithPast#
Bases:
transformers.utils.ModelOutputBase class for causal language model (or autoregressive) outputs as well as Mixture of Expert’s router hidden states terms, to train a MoE model.
- Parameters:
loss (
torch.FloatTensorof shape(1,), optional, returned whenlabelsis provided) – Language modeling loss (for next-token prediction).logits (
torch.FloatTensorof shape(batch_size, sequence_length, config.vocab_size)) – Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).past_key_values (
Cache, optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) –It is a [
~cache_utils.Cache] instance. For more details, see our kv cache guide.Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
past_key_valuesinput) to speed up sequential decoding.hidden_states (
tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) –Tuple of
torch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (
tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) –Tuple of
torch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
z_loss (
torch.FloatTensor, optional, returned whenlabelsis provided) – z_loss for the sparse modules.aux_loss (
torch.FloatTensor, optional, returned whenlabelsis provided) – aux_loss for the sparse modules.router_logits (
tuple(torch.FloatTensor), optional, returned whenoutput_router_logits=Trueis passed or whenconfig.add_router_probs=True) –Tuple of
torch.FloatTensor(one for each layer) of shape(batch_size, sequence_length, num_experts).Router logits of the encoder model, useful to compute the auxiliary loss and the z_loss for the sparse modules.
- loss: Optional[torch.FloatTensor]#
None
- logits: Optional[torch.FloatTensor]#
None
- past_key_values: Optional[transformers.cache_utils.Cache]#
None
None
- attentions: Optional[tuple[torch.FloatTensor, ...]]#
None
- z_loss: Optional[torch.FloatTensor]#
None
- aux_loss: Optional[torch.FloatTensor]#
None
- router_logits: Optional[tuple[torch.FloatTensor]]#
None
- mtp_loss: Optional[torch.FloatTensor]#
None
- mtp_logits: Optional[tuple[torch.FloatTensor, ...]]#
None
- class bridge.models.bailing.modeling_bailing_moe_v2.MoeV2ModelOutputWithPast(mtp_hidden_states=None, **kwargs)#
Bases:
transformers.modeling_outputs.MoeModelOutputWithPastInitialization
- bridge.models.bailing.modeling_bailing_moe_v2._get_unpad_data(attention_mask)#
- bridge.models.bailing.modeling_bailing_moe_v2._expand_mask(
- mask: torch.Tensor,
- dtype: torch.dtype,
- tgt_len: Optional[int] = None,
- bridge.models.bailing.modeling_bailing_moe_v2._make_causal_mask(
- input_ids_shape: torch.Size,
- dtype: torch.dtype,
- device: torch.device,
- past_key_values_length: int = 0,
- class bridge.models.bailing.modeling_bailing_moe_v2.BailingMoeV2RMSNorm(hidden_size, eps=1e-06)#
Bases:
torch.nn.ModuleInitialization
BailingMoeV2RMSNorm is equivalent to T5LayerNorm
- forward(hidden_states)#
- bridge.models.bailing.modeling_bailing_moe_v2._default_rope_init_fn(config, device=None)#
Fallback RoPE initialization for models without a specific scaling type.
Provides standard (non-scaled) RoPE initialisation when ‘default’ is not present in ROPE_INIT_FUNCTIONS (removed in transformers >= 4.x).
- class bridge.models.bailing.modeling_bailing_moe_v2.BailingMoeV2RotaryEmbedding(
- config: megatron.bridge.models.bailing.configuration_bailing_moe_v2.BailingMoeV2Config,
- device=None,
Bases:
torch.nn.ModuleInitialization
- forward(x, position_ids)#
- bridge.models.bailing.modeling_bailing_moe_v2.rotate_half(x)#
Rotates half the hidden dims of the input.
- bridge.models.bailing.modeling_bailing_moe_v2.apply_rotary_pos_emb(q, k, cos, sin, unsqueeze_dim=1)#
Applies Rotary Position Embedding to the query and key tensors.
- Parameters:
q (
torch.Tensor) – The query tensor.k (
torch.Tensor) – The key tensor.cos (
torch.Tensor) – The cosine part of the rotary embedding.sin (
torch.Tensor) – The sine part of the rotary embedding.unsqueeze_dim (
int, optional, defaults to 1) – The ‘unsqueeze_dim’ argument specifies the dimension along which to unsqueeze cos[position_ids] and sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
- Returns:
tuple(torch.Tensor)comprising the query and key tensors rotated using the Rotary Position Embedding.
- class bridge.models.bailing.modeling_bailing_moe_v2.BailingMoeV2MLP(
- config: megatron.bridge.models.bailing.configuration_bailing_moe_v2.BailingMoeV2Config,
- intermediate_size: int,
Bases:
torch.nn.ModuleInitialization
- forward(x)#
- class bridge.models.bailing.modeling_bailing_moe_v2.BailingMoeV2Gate(config)#
Bases:
torch.nn.ModuleInitialization
- reset_parameters() None#
- group_limited_topk(scores: torch.Tensor)#
- forward(hidden_states)#
- class bridge.models.bailing.modeling_bailing_moe_v2.BailingMoeV2SparseMoeBlock(
- config: megatron.bridge.models.bailing.configuration_bailing_moe_v2.BailingMoeV2Config,
Bases:
torch.nn.ModuleA mixed expert module containing shared experts.
Initialization
- _setup_experts()#
- forward(hidden_states)#
- moe_infer(x, topk_ids, topk_weight)#
- bridge.models.bailing.modeling_bailing_moe_v2.repeat_kv(hidden_states: torch.Tensor, n_rep: int) torch.Tensor#
This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch, num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
- class bridge.models.bailing.modeling_bailing_moe_v2.BailingMoeV2Attention(
- config: megatron.bridge.models.bailing.configuration_bailing_moe_v2.BailingMoeV2Config,
- layer_idx: Optional[int] = None,
Bases:
torch.nn.ModuleMulti-headed attention from ‘Attention Is All You Need’ paper
Initialization
- _shape(tensor: torch.Tensor, seq_len: int, bsz: int)#
- forward(
- hidden_states: torch.Tensor,
- attention_mask: Optional[torch.Tensor] = None,
- position_ids: Optional[torch.LongTensor] = None,
- past_key_value: Optional[transformers.cache_utils.Cache] = None,
- output_attentions: bool = False,
- use_cache: bool = False,
- position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
- **kwargs,
- class bridge.models.bailing.modeling_bailing_moe_v2.BailingMoeV2FlashAttention2(*args, **kwargs)#
Bases:
bridge.models.bailing.modeling_bailing_moe_v2.BailingMoeV2AttentionBailingMoeV2 flash attention module. This module inherits from
BailingMoeV2Attentionas the weights of the module stays untouched. The only required change would be on the forward pass where it needs to correctly call the public API of flash attention and deal with padding tokens in case the input contains any of them.Initialization
- forward(
- hidden_states: torch.Tensor,
- attention_mask: Optional[torch.LongTensor] = None,
- position_ids: Optional[torch.LongTensor] = None,
- past_key_value: Optional[transformers.cache_utils.Cache] = None,
- output_attentions: bool = False,
- use_cache: bool = False,
- position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
- **kwargs,
- _flash_attention_forward(
- query_states,
- key_states,
- value_states,
- attention_mask,
- query_length,
- dropout=0.0,
- softmax_scale=None,
Calls the forward method of Flash Attention - if the input hidden states contain at least one padding token first unpad the input, then computes the attention scores and pad the final attention scores.
- Parameters:
query_states (
torch.Tensor) – Input query states to be passed to Flash Attention APIkey_states (
torch.Tensor) – Input key states to be passed to Flash Attention APIvalue_states (
torch.Tensor) – Input value states to be passed to Flash Attention APIattention_mask (
torch.Tensor) – The padding mask - corresponds to a tensor of size(batch_size, seq_len)where 0 stands for the position of padding tokens and 1 for the position of non-padding tokens.dropout (
int, optional) – Attention dropoutsoftmax_scale (
float, optional) – The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)query_length (
int) – The length of the query sequence in terms of tokens. This represents the number of tokens in thequery_statestensor along the sequence dimension. It is used to determine the effective sequence length for attention computations.
- _upad_input(
- query_layer,
- key_layer,
- value_layer,
- attention_mask,
- query_length,
- class bridge.models.bailing.modeling_bailing_moe_v2.BailingMoeV2SdpaAttention(
- config: megatron.bridge.models.bailing.configuration_bailing_moe_v2.BailingMoeV2Config,
- layer_idx: Optional[int] = None,
Bases:
bridge.models.bailing.modeling_bailing_moe_v2.BailingMoeV2AttentionBailingMoeV2 attention module using torch.nn.functional.scaled_dot_product_attention. This module inherits from
BailingMoeV2Attentionas the weights of the module stays untouched. The only changes are on the forward pass to adapt to SDPA API.Initialization
- forward(
- hidden_states: torch.Tensor,
- attention_mask: Optional[torch.Tensor] = None,
- position_ids: Optional[torch.LongTensor] = None,
- past_key_value: Optional[transformers.cache_utils.Cache] = None,
- output_attentions: bool = False,
- use_cache: bool = False,
- position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
- **kwargs,
- bridge.models.bailing.modeling_bailing_moe_v2.ATTENTION_CLASSES#
None
- class bridge.models.bailing.modeling_bailing_moe_v2.BailingMoeV2MTPLayer(
- config: megatron.bridge.models.bailing.configuration_bailing_moe_v2.BailingMoeV2Config,
- layer_idx: int,
Bases:
torch.nn.ModuleInitialization
- forward(
- input_embeds,
- hidden_states: torch.Tensor,
- attention_mask: Optional[torch.Tensor] = None,
- position_ids: Optional[torch.LongTensor] = None,
- past_key_value: Optional[Tuple[torch.Tensor]] = None,
- output_attentions: Optional[bool] = False,
- output_router_logits: Optional[bool] = False,
- use_cache: Optional[bool] = False,
- position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
- **kwargs,
- class bridge.models.bailing.modeling_bailing_moe_v2.BailingMoeV2DecoderLayer(
- config: megatron.bridge.models.bailing.configuration_bailing_moe_v2.BailingMoeV2Config,
- layer_idx: int,
Bases:
torch.nn.ModuleInitialization
- forward(
- hidden_states: torch.Tensor,
- attention_mask: Optional[torch.Tensor] = None,
- position_ids: Optional[torch.LongTensor] = None,
- past_key_value: Optional[Tuple[torch.Tensor]] = None,
- output_attentions: Optional[bool] = False,
- output_router_logits: Optional[bool] = False,
- use_cache: Optional[bool] = False,
- position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
- **kwargs,
- Parameters:
hidden_states (
torch.FloatTensor) – input to the layer of shape(batch, seq_len, embed_dim)attention_mask (
torch.FloatTensor, optional) – attention mask of size(batch_size, sequence_length)if flash attention is used or(batch_size, 1, query_sequence_length, key_sequence_length)if default attention is used.position_ids (
torch.LongTensorof shape(batch_size, sequence_length), optional) – Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1].past_key_value (
Tuple(torch.FloatTensor), optional) – cached past key and value projection statesoutput_attentions (
bool, optional) – Whether to return the attentions tensors of all attention layers. Seeattentionsunder returned tensors for more detail.output_router_logits (
bool, optional) – Whether or not to return the logits of all the routers. They are useful for computing the router loss, and should not be returned during inference.use_cache (
bool, optional) – If set toTrue,past_key_valueskey value states are returned and can be used to speed up decoding (seepast_key_values).
- bridge.models.bailing.modeling_bailing_moe_v2.BAILINGMOEV2_START_DOCSTRING = <Multiline-String>#
- class bridge.models.bailing.modeling_bailing_moe_v2.BailingMoeV2PreTrainedModel#
Bases:
transformers.modeling_utils.PreTrainedModel- config_class#
None
- base_model_prefix#
‘model’
- supports_gradient_checkpointing#
True
- _no_split_modules#
[‘BailingMoeV2DecoderLayer’]
- _skip_keys_device_placement#
‘past_key_values’
- _supports_flash_attn_2#
True
- _supports_sdpa#
True
- _supports_cache_class#
True
- _init_weights(module)#
- bridge.models.bailing.modeling_bailing_moe_v2.BAILINGMOEV2_INPUTS_DOCSTRING = <Multiline-String>#
- class bridge.models.bailing.modeling_bailing_moe_v2.BailingMoeV2Model(
- config: megatron.bridge.models.bailing.configuration_bailing_moe_v2.BailingMoeV2Config,
Bases:
bridge.models.bailing.modeling_bailing_moe_v2.BailingMoeV2PreTrainedModelTransformer decoder consisting of config.num_hidden_layers layers. Each layer is a [
BailingMoeV2DecoderLayer]- Parameters:
config – BailingMoeV2Config
Initialization
- get_input_embeddings()#
- set_input_embeddings(value)#
- forward(
- input_ids: torch.LongTensor = None,
- attention_mask: Optional[torch.Tensor] = None,
- position_ids: Optional[torch.LongTensor] = None,
- past_key_values: Optional[List[torch.FloatTensor]] = None,
- inputs_embeds: Optional[torch.FloatTensor] = None,
- use_cache: Optional[bool] = None,
- output_attentions: Optional[bool] = None,
- output_hidden_states: Optional[bool] = None,
- output_router_logits: Optional[bool] = None,
- return_dict: Optional[bool] = None,
- **kwargs,
- class bridge.models.bailing.modeling_bailing_moe_v2.BailingMoeV2ForCausalLM(
- config: megatron.bridge.models.bailing.configuration_bailing_moe_v2.BailingMoeV2Config,
Bases:
bridge.models.bailing.modeling_bailing_moe_v2.BailingMoeV2PreTrainedModel,transformers.generation.utils.GenerationMixinInitialization
- _tied_weights_keys#
None
- get_input_embeddings()#
- set_input_embeddings(value)#
- get_output_embeddings()#
- set_output_embeddings(new_embeddings)#
- set_decoder(decoder)#
- get_decoder()#
- forward(
- input_ids: torch.LongTensor = None,
- attention_mask: Optional[torch.Tensor] = None,
- position_ids: Optional[torch.LongTensor] = None,
- past_key_values: Optional[List[torch.FloatTensor]] = None,
- inputs_embeds: Optional[torch.FloatTensor] = None,
- labels: Optional[torch.LongTensor] = None,
- use_cache: Optional[bool] = None,
- output_attentions: Optional[bool] = None,
- output_hidden_states: Optional[bool] = None,
- output_router_logits: Optional[bool] = None,
- return_dict: Optional[bool] = None,
- **kwargs,
- Parameters:
labels (
torch.LongTensorof shape(batch_size, sequence_length), optional) – Labels for computing the masked language modeling loss. Indices should either be in[0, ..., config.vocab_size]or -100 (seeinput_idsdocstring). Tokens with indices set to-100are ignored (masked), the loss is only computed for the tokens with labels in[0, ..., config.vocab_size].
Returns:
Example:
>>> from transformers import AutoTokenizer >>> model = BailingMoeV2ForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS) >>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER) >>> prompt = "Hey, are you conscious? Can you talk to me?" >>> inputs = tokenizer(prompt, return_tensors="pt") >>> # Generate >>> generate_ids = model.generate(inputs.input_ids, max_length=30) >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."