core.extensions.kitchen#
Module Contents#
Classes#
Configuration object types in config dictionary |
|
Dataclass to parse values from config dict of ‘QFlashAttentionParams’ type |
|
Dataclass to parse values from config dict of ‘QAttentionParams’ type |
|
Dataclass to parse values from config dict of ‘QLinearParams’ type |
|
Dataclass to parse values from config dict of ‘CompoundParams’ type |
|
Quantization parameters used for kitchen extensions |
|
Wrapper for Kitchen’s |
|
Wrapper for the Kitchen’s |
|
Wrapper for Kitchen’s |
|
Wrapper for Kitchen’s |
|
Wrapper for Kitchen’s |
|
Wrapper for Kitchen’s |
|
Wrapper for Kitchen’s |
|
Flash Attention implementation for Kitchen. |
|
Region where selective activation recomputation is applied. This region is memory intensive but less compute intensive which makes activation checkpointing more efficient for LLMs (20B+). See Reducing Activation Recomputation in Large Transformer Models: https://arxiv.org/abs/2205.05198 for more details. |
|
A protocol for providing the submodules used in Spec building. |
Functions#
Data#
API#
- core.extensions.kitchen.logger#
‘getLogger(…)’
- core.extensions.kitchen._KITCHEN_CONFIG_TYPE_KEY#
‘kitchen_config_type’
- class core.extensions.kitchen.KitchenConfigType(*args, **kwds)#
Bases:
enum.EnumConfiguration object types in config dictionary
Initialization
- QLINEAR_PARAMS#
‘QLinearParams’
- QATTENTION_PARAMS#
‘QAttentionParams’
- QFLASHATTENTION_PARAMS#
‘QFlashAttentionParams’
- COMPOUND_PARAMS#
‘CompoundParams’
- class core.extensions.kitchen.QFlashAttentionParamsConfigSchema#
Dataclass to parse values from config dict of ‘QFlashAttentionParams’ type
- kitchen_config_type: core.extensions.kitchen.KitchenConfigType#
None
- recipe_name: str#
None
- classmethod parse_config_dict(
- config_dict: Dict[Any, Any],
Parse config dictionary and return a schema instance.
Expected config format: {“kitchen_config_type”: “QFlashAttentionParams”, “recipe_name”:
}
- classmethod get_expected_keys() Set[str]#
Get expected keys from the dataclass fields.
- __post_init__()#
- to_kitchen_qfa() nvidia_kitchen.fa_params.QFlashAttentionParams#
Converts to kitchen library’s QFlashAttentionParams object.
- class core.extensions.kitchen.QAttentionParamsConfigSchema#
Dataclass to parse values from config dict of ‘QAttentionParams’ type
- kitchen_config_type: core.extensions.kitchen.KitchenConfigType#
None
- recipe_idx: int#
None
- classmethod parse_config_dict(
- config_dict: Dict[Any, Any],
Parse config dictionary and return a schema instance.
Expected config format: {“kitchen_config_type”: “QLinearParams”, “recipe_idx”:
}
- classmethod get_expected_keys() Set[str]#
Get expected keys from the dataclass fields.
- __post_init__()#
- to_kitchen_qattention() nvidia_kitchen.attention.QAttentionParams#
Converts to kitchen library’s QAttentionParams object.
- class core.extensions.kitchen.QLinearParamsConfigSchema#
Dataclass to parse values from config dict of ‘QLinearParams’ type
- kitchen_config_type: core.extensions.kitchen.KitchenConfigType#
None
- recipe_idx: int#
None
- classmethod parse_config_dict(
- config_dict: Dict[Any, Any],
Parse config dictionary and return a schema instance.
Expected config format: {“kitchen_config_type”: “QLinearParams”, “recipe_idx”:
}
- classmethod get_expected_keys() Set[str]#
Get expected keys from the dataclass fields.
- __post_init__()#
- to_kitchen_qlinear() nvidia_kitchen.config.QLinearParams#
Converts to kitchen library’s QLinearParams object.
- class core.extensions.kitchen.CompoundParamsConfigSchema#
Dataclass to parse values from config dict of ‘CompoundParams’ type
- kitchen_config_type: core.extensions.kitchen.KitchenConfigType#
None
- configs: Dict[Any, Any]#
None
- q_linear_params: Optional[core.extensions.kitchen.QLinearParamsConfigSchema]#
None
- q_attention_params: Optional[core.extensions.kitchen.QAttentionParamsConfigSchema]#
None
- q_fa_params: Optional[core.extensions.kitchen.QFlashAttentionParamsConfigSchema]#
None
- classmethod parse_config_dict(
- config_dict: Dict[Any, Any],
Parse config dictionary and return a schema instance.
Expected config format: { “kitchen_config_type”: “CompoundParams”, “configs”: [ {“kitchen_config_type”: “QLinearParams”, “recipe_idx”:
}, {“kitchen_config_type”: “QAttentionParams”, “recipe_idx”: }, ] } or { “kitchen_config_type”: “CompoundParams”, “configs”: [ {“kitchen_config_type”: “QLinearParams”, “recipe_idx”:
}, {“kitchen_config_type”: “QFlashAttentionParams”, “recipe_name”: }, ] }
- classmethod get_expected_keys() Set[str]#
Get expected keys from the dataclass fields.
- __post_init__()#
- get_qlinear_params() Optional[nvidia_kitchen.config.QLinearParams]#
Returns the QLinearParams object for the compound params.
- get_qattention_params() Optional[nvidia_kitchen.attention.QAttentionParams]#
Returns the QAttentionParams object for the compound params.
- get_qfa_params() Optional[nvidia_kitchen.fa_params.QFlashAttentionParams]#
Returns the QFlashAttentionParams object for the compound params.
- class core.extensions.kitchen.KitchenQuantizationParams#
Quantization parameters used for kitchen extensions
- qlinear_params: Optional[nvidia_kitchen.config.QLinearParams]#
None
- match_input: megatron.core.quantization.quant_config.MatchContext#
None
- params_config_key: str#
None
- qattention_params: Optional[nvidia_kitchen.attention.QAttentionParams]#
None
- qfa_params: Optional[nvidia_kitchen.fa_params.QFlashAttentionParams]#
None
- static parse_from_config(
- quant_config: megatron.core.quantization.quant_config.QuantizationConfig,
Parses quantization config for a layer or throw an error.
- core.extensions.kitchen._get_extra_kitchen_kwargs(
- config: megatron.core.transformer.transformer_config.TransformerConfig,
- class core.extensions.kitchen.KitchenLinear(
- input_size: int,
- output_size: int,
- *,
- parallel_mode: Optional[str],
- config: megatron.core.model_parallel_config.ModelParallelConfig,
- init_method: Callable,
- bias: bool,
- skip_bias_add: bool,
- skip_weight_param_allocation: bool,
- tp_comm_buffer_name: Optional[str] = None,
- layer_number: Optional[int] = None,
- is_expert: bool = False,
- tp_group: Optional[torch.distributed.ProcessGroup] = None,
Bases:
nvidia_kitchen.LinearWrapper for Kitchen’s
Linearlayer.Note that if Megatron’s parallel_state has not been initialized yet, the tp_group passed to Kitchen will be None and must be set later via set_tensor_parallel_group().
parallel_mode currently supports 3 different values: - “column”: Split the weight matrix along output dimension (for KitchenColumnParallelLinear) - “row”: Split the weight matrix along input dimension (for KitchenRowParallelLinear) - “duplicated”: No tensor parallelism and weight is duplicated across TP ranks - Note: For expert linear layers, we will disable communication logic here as TP communication is handled in token_dispatcher.
Initialization
- finish_init(
- quantization_config: megatron.core.quantization.quant_config.QuantizationConfig,
Required post-init of quantization configuration.
- forward(x)#
Forward.
- sharded_state_dict(prefix='', sharded_offsets=(), metadata=None)#
Replicate cross TP/DP.
- class core.extensions.kitchen.KitchenColumnParallelLinear(
- input_size: int,
- output_size: int,
- *,
- config: megatron.core.model_parallel_config.ModelParallelConfig,
- init_method: Callable,
- gather_output: bool,
- bias: bool,
- skip_bias_add: bool,
- is_expert: bool,
- skip_weight_param_allocation: bool = False,
- tp_comm_buffer_name: Optional[str] = None,
- layer_number: Optional[int] = None,
- tp_group: Optional[torch.distributed.ProcessGroup] = None,
Bases:
core.extensions.kitchen.KitchenLinearWrapper for the Kitchen’s
Linearlayer but specialized similar to megatron’sColumnParallelLinearlayer.Initialization
- sharded_state_dict(prefix='', sharded_offsets=(), metadata=None)#
Sharding along axis 0, bias sharded
- __repr__()#
- class core.extensions.kitchen.KitchenRowParallelLinear(
- input_size: int,
- output_size: int,
- *,
- config: megatron.core.model_parallel_config.ModelParallelConfig,
- init_method: Callable,
- bias: bool,
- input_is_parallel: bool,
- skip_bias_add: bool,
- is_expert: bool,
- tp_comm_buffer_name: Optional[str] = None,
- layer_number: Optional[int] = None,
- tp_group: Optional[torch.distributed.ProcessGroup] = None,
Bases:
core.extensions.kitchen.KitchenLinearWrapper for Kitchen’s
Linearlayer but specialized similar to megatron’sRowParallelLinearlayer.Initialization
- sharded_state_dict(prefix='', sharded_offsets=(), metadata=None)#
Sharding along axis 1, bias not sharded
- __repr__()#
- class core.extensions.kitchen.KitchenGroupedLinear(
- num_gemms: int,
- input_size: int,
- output_size: int,
- *,
- parallel_mode: Optional[str],
- config: megatron.core.model_parallel_config.ModelParallelConfig,
- init_method: Callable,
- bias: bool,
- skip_bias_add: bool,
- is_expert: bool = False,
- tp_comm_buffer_name: Optional[str] = None,
- layer_number: Optional[int] = None,
- tp_group: Optional[torch.distributed.ProcessGroup] = None,
Bases:
nvidia_kitchen.GroupedLinearWrapper for Kitchen’s
GroupedLinearlayer.Note that if Megatron’s parallel_state has not been initialized yet, the tp_group passed to TE will be None and must be set later via set_tensor_parallel_group().
Initialization
- finish_init(
- quantization_config: megatron.core.quantization.quant_config.QuantizationConfig,
Required post-init of quantization configuration.
- forward(x, m_splits)#
Forward.
- _encode_extra_state(state)#
- _decode_extra_state(state)#
- _split_extra_state(state)#
- _sharded_state_dict_grouped(
- tp_axis_map,
- prefix='',
- sharded_offsets=(),
- metadata=None,
prefix should be module_name to make keys identical to sequetial ones.
- class core.extensions.kitchen.KitchenColumnParallelGroupedLinear(
- num_gemms: int,
- input_size: int,
- output_size: int,
- *,
- config: megatron.core.model_parallel_config.ModelParallelConfig,
- init_method: Callable,
- bias: bool,
- skip_bias_add: bool,
- is_expert: bool,
- tp_comm_buffer_name: Optional[str] = None,
- layer_number: Optional[int] = None,
- tp_group: Optional[torch.distributed.ProcessGroup] = None,
Bases:
core.extensions.kitchen.KitchenGroupedLinearWrapper for Kitchen’s
GroupedLinearlayer but specialized to column-parallel style.Initialization
- sharded_state_dict(prefix='', sharded_offsets=(), metadata=None)#
For each gemm, sharding along axis 0, bias sharded. Assume sharded_offsets[-1] is the expert parallel offset.
- class core.extensions.kitchen.KitchenRowParallelGroupedLinear(
- num_gemms: int,
- input_size: int,
- output_size: int,
- *,
- config: megatron.core.model_parallel_config.ModelParallelConfig,
- init_method: Callable,
- bias: bool,
- skip_bias_add: bool,
- is_expert: bool,
- tp_comm_buffer_name: Optional[str] = None,
- layer_number: Optional[int] = None,
- tp_group: Optional[torch.distributed.ProcessGroup] = None,
Bases:
core.extensions.kitchen.KitchenGroupedLinearWrapper for Kitchen’s
GroupedLinearlayer but specialized to row-parallel style.Initialization
- sharded_state_dict(prefix='', sharded_offsets=(), metadata=None)#
For each gemm, sharding along axis 1, bias not sharded. Assume sharded_offsets[-1] is the expert parallel offset.
- class core.extensions.kitchen.KitchenLayerNormColumnParallelLinear(
- input_size: int,
- output_size: int,
- *,
- config: megatron.core.transformer.transformer_config.TransformerConfig,
- init_method: Callable,
- gather_output: bool,
- bias: bool,
- skip_bias_add: bool,
- is_expert: bool,
- skip_weight_param_allocation: bool = False,
- layer_number: Optional[int] = None,
- tp_comm_buffer_name: Optional[str] = None,
- tp_group: Optional[torch.distributed.ProcessGroup] = None,
Bases:
nvidia_kitchen.LayerNormLinearWrapper for Kitchen’s
LayerNormLinearlayer that combines layernorm and linear layersInitialization
- finish_init(
- quantization_config: megatron.core.quantization.quant_config.QuantizationConfig,
Required post-init of quantization configuration.
- forward(x)#
Forward.
- sharded_state_dict(prefix='', sharded_offsets=(), metadata=None)#
Sharding along axis 0, bias sharded
- __repr__()#
- class core.extensions.kitchen.KitchenFlashAttention(
- config: megatron.core.transformer.transformer_config.TransformerConfig,
- layer_number: int,
- attn_mask_type: megatron.core.transformer.enums.AttnMaskType,
- attention_type: str,
- attention_dropout: Optional[float] = None,
- softmax_scale: Optional[float] = None,
- cp_comm_type: Optional[str] = None,
- pg_collection: Optional[megatron.core.process_groups_config.ProcessGroupCollection] = None,
Bases:
megatron.core.transformer.module.MegatronModuleFlash Attention implementation for Kitchen.
Initialization
- finish_init(
- quantization_config: megatron.core.quantization.quant_config.QuantizationConfig,
Finishes the initialization of the KitchenFlashAttention module.
- forward(
- query: torch.Tensor,
- key: torch.Tensor,
- value: torch.Tensor,
- attention_mask: torch.Tensor,
- attn_mask_type: megatron.core.transformer.enums.AttnMaskType = None,
- attention_bias: torch.Tensor = None,
- packed_seq_params: Optional[megatron.core.packed_seq_params.PackedSeqParams] = None,
Forward.
- class core.extensions.kitchen.KitchenDotProductAttention(
- config: megatron.core.transformer.transformer_config.TransformerConfig,
- layer_number: int,
- attn_mask_type: megatron.core.transformer.enums.AttnMaskType,
- attention_type: str,
- attention_dropout: Optional[float] = None,
- softmax_scale: Optional[float] = None,
- cp_comm_type: Optional[str] = None,
- pg_collection: Optional[megatron.core.process_groups_config.ProcessGroupCollection] = None,
Bases:
megatron.core.transformer.module.MegatronModuleRegion where selective activation recomputation is applied. This region is memory intensive but less compute intensive which makes activation checkpointing more efficient for LLMs (20B+). See Reducing Activation Recomputation in Large Transformer Models: https://arxiv.org/abs/2205.05198 for more details.
We use the following notation: h: hidden size n: number of attention heads p: number of tensor model parallel partitions b: batch size s: sequence length
Initialization
- finish_init(
- quantization_config: megatron.core.quantization.quant_config.QuantizationConfig,
Finishes the initialization of the KitchenDotProductAttention module.
- forward(
- query: torch.Tensor,
- key: torch.Tensor,
- value: torch.Tensor,
- attention_mask: torch.Tensor,
- attn_mask_type: megatron.core.transformer.enums.AttnMaskType = None,
- attention_bias: torch.Tensor = None,
- packed_seq_params: Optional[megatron.core.packed_seq_params.PackedSeqParams] = None,
Forward.
- class core.extensions.kitchen.KitchenSpecProvider(
- fallback: megatron.core.models.backends.BackendSpecProvider,
- use_kitchen_attention: bool = False,
- kitchen_attention_backend: str = 'sdpa',
Bases:
megatron.core.models.backends.BackendSpecProviderA protocol for providing the submodules used in Spec building.
Initialization
- column_parallel_linear() type#
Which column parallel linear module kitchen backend uses
- row_parallel_linear() type#
Which row parallel linear module kitchen backend uses
- fuse_layernorm_and_linear() bool#
Does kitchen backend support a single module for layernorm and linear
- column_parallel_layer_norm_linear() Optional[type]#
Which module for sequential layernorm and linear
- layer_norm(rms_norm: bool = False, for_qk: bool = False) type#
Which module to use for layer norm
- core_attention() type#
Which module to use for attention
- grouped_mlp_modules(
- moe_use_grouped_gemm: bool,
- moe_use_legacy_grouped_gemm: bool,
Which module and submodules to use for grouped mlp
- activation_func() type#
Which module to use for activation function