bridge.models.llama.llama_provider
#
Module Contents#
Classes#
Configuration class for Llama models. |
|
Configuration for a 7B parameter Llama 2 model. |
|
Configuration for a 13B parameter Llama 2 model. |
|
Configuration for a 70B parameter Llama 2 model. |
|
Configuration for Llama 3 models. |
|
Configuration for Llama 3.1 models. |
|
Configuration for an 8B parameter Llama 3 model. |
|
Configuration for a 70B parameter Llama 3 model. |
|
Configuration for an 8B parameter Llama 3.1 model. |
|
Configuration for a 70B parameter Llama 3.1 model. |
|
Configuration for a 405B parameter Llama 3.1 model. |
|
Configuration for a 1B parameter Llama 3.2 model. |
|
Configuration for a 3B parameter Llama 3.2 model. |
|
Configuration for a 7B parameter CodeLlama model. |
|
Configuration for a 13B parameter CodeLlama model. |
|
Configuration for a 34B parameter CodeLlama model. |
|
Configuration for a 70B parameter CodeLlama model. |
|
Configuration for Llama4 language model. |
|
Configuration for llama4 16-experts model. |
|
Configuration for llama4 128-experts model. |
Functions#
Apply RoPE scaling for extending context length in Llama models. |
Data#
API#
- bridge.models.llama.llama_provider.logger#
‘getLogger(…)’
- class bridge.models.llama.llama_provider.LlamaModelProvider#
Bases:
megatron.bridge.models.gpt_provider.GPTModelProvider
Configuration class for Llama models.
Extends GPTConfig with specific settings optimized for Llama architectures. Includes configurations for normalization, activation functions, and various architecture-specific options.
- normalization: str#
‘RMSNorm’
- activation_func: Callable#
None
- gated_linear_unit: bool#
True
- position_embedding_type: str#
‘rope’
- add_bias_linear: bool#
False
- seq_length: int#
4096
- attention_dropout: float#
0.0
0.0
False
- bias_activation_fusion: bool#
True
- masked_softmax_fusion: bool#
‘field(…)’
- bias_dropout_fusion: bool#
‘field(…)’
- apply_rope_fusion: bool#
‘field(…)’
- use_transformer_engine_op_fuser: Optional[bool]#
None
- class bridge.models.llama.llama_provider.Llama2ModelProvider7B#
Bases:
bridge.models.llama.llama_provider.LlamaModelProvider
Configuration for a 7B parameter Llama 2 model.
Specific configuration for the 7B Llama 2 model with 32 layers, 4096 hidden size, and 32 attention heads.
- num_layers: int#
32
4096
- num_attention_heads: int#
32
- num_query_groups: int#
32
11008
- class bridge.models.llama.llama_provider.Llama2ModelProvider13B#
Bases:
bridge.models.llama.llama_provider.LlamaModelProvider
Configuration for a 13B parameter Llama 2 model.
Specific configuration for the 13B Llama 2 model with 40 layers, 5120 hidden size, and 40 attention heads.
- num_layers: int#
40
5120
- num_attention_heads: int#
40
- num_query_groups: int#
40
13824
- class bridge.models.llama.llama_provider.Llama2ModelProvider70B#
Bases:
bridge.models.llama.llama_provider.LlamaModelProvider
Configuration for a 70B parameter Llama 2 model.
Specific configuration for the 70B Llama 2 model with 80 layers, 8192 hidden size, and 64 attention heads with 8 query groups.
- num_layers: int#
80
8192
- num_attention_heads: int#
64
- num_query_groups: int#
8
28672
- class bridge.models.llama.llama_provider.Llama3ModelProvider#
Bases:
bridge.models.llama.llama_provider.LlamaModelProvider
Configuration for Llama 3 models.
Base configuration for Llama 3 architecture with common settings across different model sizes, including group query attention (GQA) and architecture-specific settings.
- num_query_groups: int#
8
0.0
- attention_dropout: float#
0.0
- normalization: str#
‘RMSNorm’
- init_method_std: float#
0.01
- layernorm_epsilon: float#
1e-05
- add_bias_linear: bool#
False
- activation_func: Callable#
None
- gated_linear_unit: bool#
True
- bias_activation_fusion: bool#
True
- masked_softmax_fusion: bool#
‘field(…)’
- bias_dropout_fusion: bool#
‘field(…)’
- apply_rope_fusion: bool#
‘field(…)’
False
- position_embedding_type: str#
‘rope’
- rotary_percent: float#
1.0
- class bridge.models.llama.llama_provider.Llama31ModelProvider#
Bases:
bridge.models.llama.llama_provider.Llama3ModelProvider
Configuration for Llama 3.1 models.
Extends Llama3ModelProvider with specific settings for Llama 3.1 models, including RoPE scaling parameters.
- scale_factor: float#
8.0
- low_freq_factor: float#
1.0
- high_freq_factor: float#
4.0
- old_context_len: int#
8192
- init_method_std: float#
0.02
- provide(
- pre_process=None,
- post_process=None,
- vp_stage=None,
- tokenizer=None,
Configure and instantiate a Megatron Core Llama 3.1 model.
Extends the base configuration with Llama 3.1 specific RoPE scaling.
- Parameters:
pre_process – Whether to include pre-processing in the model
post_process – Whether to include post-processing in the model
vp_stage – Virtual pipeline stage
tokenizer – Tokenizer used with the model
- Returns:
Configured Megatron Core GPT model instance
- Return type:
MCoreGPTModel
- class bridge.models.llama.llama_provider.Llama3ModelProvider8B#
Bases:
bridge.models.llama.llama_provider.Llama3ModelProvider
Configuration for an 8B parameter Llama 3 model.
Specific configuration for the 8B Llama 3 model with 32 layers, 4096 hidden size, and 32 attention heads.
- rotary_base: int#
500000
- seq_length: int#
8192
- num_layers: int#
32
4096
14336
- num_attention_heads: int#
32
- class bridge.models.llama.llama_provider.Llama3ModelProvider70B#
Bases:
bridge.models.llama.llama_provider.Llama3ModelProvider
Configuration for a 70B parameter Llama 3 model.
Specific configuration for the 70B Llama 3 model with 80 layers, 8192 hidden size, and 64 attention heads.
- rotary_base: int#
500000
- seq_length: int#
8192
- num_layers: int#
80
8192
28672
- num_attention_heads: int#
64
- init_method_std: float#
0.008944
- make_vocab_size_divisible_by: int#
128
- class bridge.models.llama.llama_provider.Llama31ModelProvider8B#
Bases:
bridge.models.llama.llama_provider.Llama31ModelProvider
Configuration for an 8B parameter Llama 3.1 model.
Specific configuration for the 8B Llama 3.1 model with 32 layers, 4096 hidden size, and 32 attention heads, supporting a longer context length of 131K tokens.
- rotary_base: int#
500000
- seq_length: int#
131072
- num_layers: int#
32
4096
14336
- num_attention_heads: int#
32
- class bridge.models.llama.llama_provider.Llama31ModelProvider70B#
Bases:
bridge.models.llama.llama_provider.Llama31ModelProvider
Configuration for a 70B parameter Llama 3.1 model.
Specific configuration for the 70B Llama 3.1 model with 80 layers, 8192 hidden size, and 64 attention heads, supporting a longer context length of 131K tokens.
- rotary_base: int#
500000
- seq_length: int#
131072
- num_layers: int#
80
8192
28672
- num_attention_heads: int#
64
- make_vocab_size_divisible_by: int#
128
- class bridge.models.llama.llama_provider.Llama31ModelProvider405B#
Bases:
bridge.models.llama.llama_provider.Llama31ModelProvider
Configuration for a 405B parameter Llama 3.1 model.
Specific configuration for the 405B Llama 3.1 model with 126 layers, 16384 hidden size, and 128 attention heads, supporting a longer context length of 131K tokens.
- rotary_base: int#
500000
- seq_length: int#
131072
- num_layers: int#
126
16384
53248
- num_attention_heads: int#
128
- make_vocab_size_divisible_by: int#
128
- class bridge.models.llama.llama_provider.Llama32ModelProvider1B#
Bases:
bridge.models.llama.llama_provider.Llama31ModelProvider
Configuration for a 1B parameter Llama 3.2 model.
Specific configuration for the 1B Llama 3.2 model with 16 layers, 2048 hidden size, and 32 attention heads (8 query groups).
- scale_factor: float#
32.0
True
- rotary_base: int#
500000
- num_layers: int#
16
2048
8192
- num_attention_heads: int#
32
- num_query_groups: int#
8
- make_vocab_size_divisible_by: int#
128
- class bridge.models.llama.llama_provider.Llama32ModelProvider3B#
Bases:
bridge.models.llama.llama_provider.Llama31ModelProvider
Configuration for a 3B parameter Llama 3.2 model.
Specific configuration for the 3B Llama 3.2 model with 28 layers, 3072 hidden size, and 24 attention heads (8 query groups).
- scale_factor: int#
32
True
- rotary_base: int#
500000
- num_layers: int#
28
3072
8192
- num_attention_heads: int#
24
- num_query_groups: int#
8
- make_vocab_size_divisible_by: int#
128
- class bridge.models.llama.llama_provider.CodeLlamaModelProvider7B#
Bases:
bridge.models.llama.llama_provider.Llama2ModelProvider7B
Configuration for a 7B parameter CodeLlama model.
Extends Llama2ModelProvider7B with modified settings specifically for code generation, including longer context length and different rotary base.
- rotary_base: int#
1000000
- seq_length: int#
16384
- class bridge.models.llama.llama_provider.CodeLlamaModelProvider13B#
Bases:
bridge.models.llama.llama_provider.Llama2ModelProvider13B
Configuration for a 13B parameter CodeLlama model.
Extends Llama2ModelProvider13B with modified settings specifically for code generation, including longer context length and different rotary base.
- rotary_base: int#
1000000
- seq_length: int#
16384
- class bridge.models.llama.llama_provider.CodeLlamaModelProvider34B#
Bases:
bridge.models.llama.llama_provider.LlamaModelProvider
Configuration for a 34B parameter CodeLlama model.
Specific configuration for the 34B CodeLlama model with 48 layers, 8192 hidden size, and 64 attention heads (8 query groups).
- num_layers: int#
48
8192
- num_attention_heads: int#
64
- num_query_groups: int#
8
22016
- rotary_base: int#
1000000
- seq_length: int#
16384
- class bridge.models.llama.llama_provider.CodeLlamaModelProvider70B#
Bases:
bridge.models.llama.llama_provider.Llama2ModelProvider70B
Configuration for a 70B parameter CodeLlama model.
Extends Llama2ModelProvider70B with settings specifically for code generation.
- class bridge.models.llama.llama_provider.Llama4ModelProvider#
Bases:
bridge.models.llama.llama_provider.Llama3ModelProvider
Configuration for Llama4 language model.
- rotary_base: int#
500000
- seq_length: int#
8192
- num_layers: int#
48
5120
16384
- num_attention_heads: int#
40
- vocab_size: int#
None
- add_bias_linear: bool#
False
- gated_linear_unit: bool#
True
- rotary_interleaved: bool#
True
- apply_rope_fusion: bool#
False
- nope_layer_interval: int#
4
- transformer_layer_spec: Union[megatron.core.transformer.ModuleSpec, Callable[[bridge.models.llama.llama_provider.LlamaModelProvider], megatron.core.transformer.ModuleSpec]]#
‘field(…)’
- moe_grouped_gemm: bool#
True
8192
8192
- moe_router_topk: int#
1
- moe_router_pre_softmax: bool#
False
- moe_router_score_function: str#
‘sigmoid’
- moe_token_dispatcher_type: str#
‘alltoall’
- moe_router_dtype: Optional[str]#
None
- moe_apply_probs_on_input: bool#
True
True
- moe_permute_fusion: bool#
False
- qk_l2_norm: bool#
True
- rope_scaling: bool#
True
- rope_scaling_factor: float#
8.0
- attention_chunk_size: int#
8192
- class bridge.models.llama.llama_provider.Llama4Experts16ModelProvider#
Bases:
bridge.models.llama.llama_provider.Llama4ModelProvider
Configuration for llama4 16-experts model.
- num_moe_experts: int#
16
- rope_scaling: bool#
True
- rope_scaling_factor: float#
8.0
- qk_l2_norm: bool#
True
- class bridge.models.llama.llama_provider.Llama4Experts128ModelProvider#
Bases:
bridge.models.llama.llama_provider.Llama4ModelProvider
Configuration for llama4 128-experts model.
- num_moe_experts: int#
128
- rope_scaling: bool#
False
- moe_layer_freq: Union[int, List[int]]#
‘field(…)’
- qk_l2_norm: bool#
False
- bridge.models.llama.llama_provider.apply_rope_scaling(
- inv_freq,
- factor: float = 8.0,
- low_freq_factor: float = 1.0,
- high_freq_factor: float = 4.0,
- old_context_len: int = 8192,
Apply RoPE scaling for extending context length in Llama models.
This implements the NTK-aware RoPE scaling method used in Llama 3.1 models to extend context length beyond the original training length.
- Parameters:
inv_freq – Original inverse frequency tensor
factor – Scaling factor for context length extension
low_freq_factor – Factor for low frequency components
high_freq_factor – Factor for high frequency components
old_context_len – Original context length
- Returns:
Modified inverse frequency tensor for extended context
- Return type:
torch.Tensor