`core.extensions.kitchen`#

Module Contents#

Classes#

`KitchenConfigType`	Configuration object types in config dictionary
`QFlashAttentionParamsConfigSchema`	Dataclass to parse values from config dict of ‘QFlashAttentionParams’ type
`QAttentionParamsConfigSchema`	Dataclass to parse values from config dict of ‘QAttentionParams’ type
`QLinearParamsConfigSchema`	Dataclass to parse values from config dict of ‘QLinearParams’ type
`CompoundParamsConfigSchema`	Dataclass to parse values from config dict of ‘CompoundParams’ type
`KitchenQuantizationParams`	Quantization parameters used for kitchen extensions
`KitchenLinear`	Wrapper for Kitchen’s `Linear` layer.
`KitchenColumnParallelLinear`	Wrapper for the Kitchen’s `Linear` layer but specialized similar to megatron’s `ColumnParallelLinear` layer.
`KitchenRowParallelLinear`	Wrapper for Kitchen’s `Linear` layer but specialized similar to megatron’s `RowParallelLinear` layer.
`KitchenGroupedLinear`	Wrapper for Kitchen’s `GroupedLinear` layer.
`KitchenColumnParallelGroupedLinear`	Wrapper for Kitchen’s `GroupedLinear` layer but specialized to column-parallel style.
`KitchenRowParallelGroupedLinear`	Wrapper for Kitchen’s `GroupedLinear` layer but specialized to row-parallel style.
`KitchenLayerNormColumnParallelLinear`	Wrapper for Kitchen’s `LayerNormLinear` layer that combines layernorm and linear layers
`KitchenFlashAttention`	Flash Attention implementation for Kitchen.
`KitchenDotProductAttention`	Region where selective activation recomputation is applied. This region is memory intensive but less compute intensive which makes activation checkpointing more efficient for LLMs (20B+). See Reducing Activation Recomputation in Large Transformer Models: https://arxiv.org/abs/2205.05198 for more details.
`KitchenSpecProvider`	A protocol for providing the submodules used in Spec building.

Functions#

_get_extra_kitchen_kwargs

Data#

`logger`
`_KITCHEN_CONFIG_TYPE_KEY`

API#

core.extensions.kitchen.logger#: ‘getLogger(…)’

core.extensions.kitchen._KITCHEN_CONFIG_TYPE_KEY#: ‘kitchen_config_type’

class core.extensions.kitchen.KitchenConfigType(*args, **kwds)#

Bases: enum.Enum

Configuration object types in config dictionary

Initialization

QLINEAR_PARAMS#: ‘QLinearParams’

QATTENTION_PARAMS#: ‘QAttentionParams’

QFLASHATTENTION_PARAMS#: ‘QFlashAttentionParams’

COMPOUND_PARAMS#: ‘CompoundParams’

class core.extensions.kitchen.QFlashAttentionParamsConfigSchema#

Dataclass to parse values from config dict of ‘QFlashAttentionParams’ type

kitchen_config_type: core.extensions.kitchen.KitchenConfigType#: None

recipe_name: str#: None

classmethod parse_config_dict( config_dict: Dict[Any, Any], ) → core.extensions.kitchen.QFlashAttentionParamsConfigSchema#

Parse config dictionary and return a schema instance.

Expected config format: {“kitchen_config_type”: “QFlashAttentionParams”, “recipe_name”: }

classmethod get_expected_keys() → Set[str]#: Get expected keys from the dataclass fields.

__post_init__()#

to_kitchen_qfa() → nvidia_kitchen.fa_params.QFlashAttentionParams#: Converts to kitchen library’s QFlashAttentionParams object.

class core.extensions.kitchen.QAttentionParamsConfigSchema#

Dataclass to parse values from config dict of ‘QAttentionParams’ type

kitchen_config_type: core.extensions.kitchen.KitchenConfigType#: None

recipe_idx: int#: None

classmethod parse_config_dict( config_dict: Dict[Any, Any], ) → core.extensions.kitchen.QAttentionParamsConfigSchema#

Parse config dictionary and return a schema instance.

Expected config format: {“kitchen_config_type”: “QLinearParams”, “recipe_idx”: }

classmethod get_expected_keys() → Set[str]#: Get expected keys from the dataclass fields.

__post_init__()#

to_kitchen_qattention() → nvidia_kitchen.attention.QAttentionParams#: Converts to kitchen library’s QAttentionParams object.

class core.extensions.kitchen.QLinearParamsConfigSchema#

Dataclass to parse values from config dict of ‘QLinearParams’ type

kitchen_config_type: core.extensions.kitchen.KitchenConfigType#: None

recipe_idx: int#: None

classmethod parse_config_dict( config_dict: Dict[Any, Any], ) → core.extensions.kitchen.QLinearParamsConfigSchema#

Parse config dictionary and return a schema instance.

Expected config format: {“kitchen_config_type”: “QLinearParams”, “recipe_idx”: }

classmethod get_expected_keys() → Set[str]#: Get expected keys from the dataclass fields.

__post_init__()#

to_kitchen_qlinear() → nvidia_kitchen.config.QLinearParams#: Converts to kitchen library’s QLinearParams object.

class core.extensions.kitchen.CompoundParamsConfigSchema#

Dataclass to parse values from config dict of ‘CompoundParams’ type

kitchen_config_type: core.extensions.kitchen.KitchenConfigType#: None

configs: Dict[Any, Any]#: None

q_linear_params: Optional[core.extensions.kitchen.QLinearParamsConfigSchema]#: None

q_attention_params: Optional[core.extensions.kitchen.QAttentionParamsConfigSchema]#: None

q_fa_params: Optional[core.extensions.kitchen.QFlashAttentionParamsConfigSchema]#: None

classmethod parse_config_dict( config_dict: Dict[Any, Any], ) → core.extensions.kitchen.CompoundParamsConfigSchema#

Parse config dictionary and return a schema instance.

Expected config format: { “kitchen_config_type”: “CompoundParams”, “configs”: [ {“kitchen_config_type”: “QLinearParams”, “recipe_idx”: }, {“kitchen_config_type”: “QAttentionParams”, “recipe_idx”: }, ] }

or { “kitchen_config_type”: “CompoundParams”, “configs”: [ {“kitchen_config_type”: “QLinearParams”, “recipe_idx”: }, {“kitchen_config_type”: “QFlashAttentionParams”, “recipe_name”: }, ] }

classmethod get_expected_keys() → Set[str]#: Get expected keys from the dataclass fields.

__post_init__()#

get_qlinear_params() → Optional[nvidia_kitchen.config.QLinearParams]#: Returns the QLinearParams object for the compound params.

get_qattention_params() → Optional[nvidia_kitchen.attention.QAttentionParams]#: Returns the QAttentionParams object for the compound params.

get_qfa_params() → Optional[nvidia_kitchen.fa_params.QFlashAttentionParams]#: Returns the QFlashAttentionParams object for the compound params.

class core.extensions.kitchen.KitchenQuantizationParams#

Quantization parameters used for kitchen extensions

qlinear_params: Optional[nvidia_kitchen.config.QLinearParams]#: None

match_input: megatron.core.quantization.quant_config.MatchContext#: None

params_config_key: str#: None

qattention_params: Optional[nvidia_kitchen.attention.QAttentionParams]#: None

qfa_params: Optional[nvidia_kitchen.fa_params.QFlashAttentionParams]#: None

static parse_from_config( quant_config: megatron.core.quantization.quant_config.QuantizationConfig, ) → core.extensions.kitchen.KitchenQuantizationParams#: Parses quantization config for a layer or throw an error.

core.extensions.kitchen._get_extra_kitchen_kwargs( config: megatron.core.transformer.transformer_config.TransformerConfig, )#

class core.extensions.kitchen.KitchenLinear( input_size: int, output_size: int, *, parallel_mode: Optional[str], config: megatron.core.model_parallel_config.ModelParallelConfig, init_method: Callable, bias: bool, skip_bias_add: bool, skip_weight_param_allocation: bool, tp_comm_buffer_name: Optional[str] = None, layer_number: Optional[int] = None, is_expert: bool = False, tp_group: Optional[torch.distributed.ProcessGroup] = None, )#

Bases: nvidia_kitchen.Linear

Wrapper for Kitchen’s Linear layer.

Note that if Megatron’s parallel_state has not been initialized yet, the tp_group passed to Kitchen will be None and must be set later via set_tensor_parallel_group().

parallel_mode currently supports 3 different values: - “column”: Split the weight matrix along output dimension (for KitchenColumnParallelLinear) - “row”: Split the weight matrix along input dimension (for KitchenRowParallelLinear) - “duplicated”: No tensor parallelism and weight is duplicated across TP ranks - Note: For expert linear layers, we will disable communication logic here as TP communication is handled in token_dispatcher.

Initialization

finish_init( quantization_config: megatron.core.quantization.quant_config.QuantizationConfig, )#: Required post-init of quantization configuration.

forward(x)#: Forward.

sharded_state_dict(prefix='', sharded_offsets=(), metadata=None)#: Replicate cross TP/DP.

class core.extensions.kitchen.KitchenColumnParallelLinear( input_size: int, output_size: int, *, config: megatron.core.model_parallel_config.ModelParallelConfig, init_method: Callable, gather_output: bool, bias: bool, skip_bias_add: bool, is_expert: bool, skip_weight_param_allocation: bool = False, tp_comm_buffer_name: Optional[str] = None, layer_number: Optional[int] = None, tp_group: Optional[torch.distributed.ProcessGroup] = None, )#

Bases: core.extensions.kitchen.KitchenLinear

Wrapper for the Kitchen’s Linear layer but specialized similar to megatron’s ColumnParallelLinear layer.

Initialization

sharded_state_dict(prefix='', sharded_offsets=(), metadata=None)#: Sharding along axis 0, bias sharded

__repr__()#

class core.extensions.kitchen.KitchenRowParallelLinear( input_size: int, output_size: int, *, config: megatron.core.model_parallel_config.ModelParallelConfig, init_method: Callable, bias: bool, input_is_parallel: bool, skip_bias_add: bool, is_expert: bool, tp_comm_buffer_name: Optional[str] = None, layer_number: Optional[int] = None, tp_group: Optional[torch.distributed.ProcessGroup] = None, )#

Bases: core.extensions.kitchen.KitchenLinear

Wrapper for Kitchen’s Linear layer but specialized similar to megatron’s RowParallelLinear layer.

Initialization

sharded_state_dict(prefix='', sharded_offsets=(), metadata=None)#: Sharding along axis 1, bias not sharded

__repr__()#

class core.extensions.kitchen.KitchenGroupedLinear( num_gemms: int, input_size: int, output_size: int, *, parallel_mode: Optional[str], config: megatron.core.model_parallel_config.ModelParallelConfig, init_method: Callable, bias: bool, skip_bias_add: bool, is_expert: bool = False, tp_comm_buffer_name: Optional[str] = None, layer_number: Optional[int] = None, tp_group: Optional[torch.distributed.ProcessGroup] = None, )#

Bases: nvidia_kitchen.GroupedLinear

Wrapper for Kitchen’s GroupedLinear layer.

Note that if Megatron’s parallel_state has not been initialized yet, the tp_group passed to TE will be None and must be set later via set_tensor_parallel_group().

Initialization

finish_init( quantization_config: megatron.core.quantization.quant_config.QuantizationConfig, ) → None#: Required post-init of quantization configuration.

forward(x, m_splits)#: Forward.

_encode_extra_state(state)#

_decode_extra_state(state)#

_split_extra_state(state)#

_sharded_state_dict_grouped( tp_axis_map, prefix='', sharded_offsets=(), metadata=None, )#: prefix should be module_name to make keys identical to sequetial ones.

class core.extensions.kitchen.KitchenColumnParallelGroupedLinear( num_gemms: int, input_size: int, output_size: int, *, config: megatron.core.model_parallel_config.ModelParallelConfig, init_method: Callable, bias: bool, skip_bias_add: bool, is_expert: bool, tp_comm_buffer_name: Optional[str] = None, layer_number: Optional[int] = None, tp_group: Optional[torch.distributed.ProcessGroup] = None, )#

Bases: core.extensions.kitchen.KitchenGroupedLinear

Wrapper for Kitchen’s GroupedLinear layer but specialized to column-parallel style.

Initialization

sharded_state_dict(prefix='', sharded_offsets=(), metadata=None)#: For each gemm, sharding along axis 0, bias sharded. Assume sharded_offsets[-1] is the expert parallel offset.

class core.extensions.kitchen.KitchenRowParallelGroupedLinear( num_gemms: int, input_size: int, output_size: int, *, config: megatron.core.model_parallel_config.ModelParallelConfig, init_method: Callable, bias: bool, skip_bias_add: bool, is_expert: bool, tp_comm_buffer_name: Optional[str] = None, layer_number: Optional[int] = None, tp_group: Optional[torch.distributed.ProcessGroup] = None, )#

Bases: core.extensions.kitchen.KitchenGroupedLinear

Wrapper for Kitchen’s GroupedLinear layer but specialized to row-parallel style.

Initialization

sharded_state_dict(prefix='', sharded_offsets=(), metadata=None)#: For each gemm, sharding along axis 1, bias not sharded. Assume sharded_offsets[-1] is the expert parallel offset.

class core.extensions.kitchen.KitchenLayerNormColumnParallelLinear( input_size: int, output_size: int, *, config: megatron.core.transformer.transformer_config.TransformerConfig, init_method: Callable, gather_output: bool, bias: bool, skip_bias_add: bool, is_expert: bool, skip_weight_param_allocation: bool = False, layer_number: Optional[int] = None, tp_comm_buffer_name: Optional[str] = None, tp_group: Optional[torch.distributed.ProcessGroup] = None, )#

Bases: nvidia_kitchen.LayerNormLinear

Wrapper for Kitchen’s LayerNormLinear layer that combines layernorm and linear layers

Initialization

finish_init( quantization_config: megatron.core.quantization.quant_config.QuantizationConfig, ) → None#: Required post-init of quantization configuration.

forward(x)#: Forward.

sharded_state_dict(prefix='', sharded_offsets=(), metadata=None)#: Sharding along axis 0, bias sharded

__repr__()#

class core.extensions.kitchen.KitchenFlashAttention( config: megatron.core.transformer.transformer_config.TransformerConfig, layer_number: int, attn_mask_type: megatron.core.transformer.enums.AttnMaskType, attention_type: str, attention_dropout: Optional[float] = None, softmax_scale: Optional[float] = None, cp_comm_type: Optional[str] = None, pg_collection: Optional[megatron.core.process_groups_config.ProcessGroupCollection] = None, )#

Bases: megatron.core.transformer.module.MegatronModule

Flash Attention implementation for Kitchen.

Initialization

finish_init( quantization_config: megatron.core.quantization.quant_config.QuantizationConfig, )#: Finishes the initialization of the KitchenFlashAttention module.

forward( query: torch.Tensor, key: torch.Tensor, value: torch.Tensor, attention_mask: torch.Tensor, attn_mask_type: megatron.core.transformer.enums.AttnMaskType = None, attention_bias: torch.Tensor = None, packed_seq_params: Optional[megatron.core.packed_seq_params.PackedSeqParams] = None, )#: Forward.

class core.extensions.kitchen.KitchenDotProductAttention( config: megatron.core.transformer.transformer_config.TransformerConfig, layer_number: int, attn_mask_type: megatron.core.transformer.enums.AttnMaskType, attention_type: str, attention_dropout: Optional[float] = None, softmax_scale: Optional[float] = None, cp_comm_type: Optional[str] = None, pg_collection: Optional[megatron.core.process_groups_config.ProcessGroupCollection] = None, )#

Bases: megatron.core.transformer.module.MegatronModule

Region where selective activation recomputation is applied. This region is memory intensive but less compute intensive which makes activation checkpointing more efficient for LLMs (20B+). See Reducing Activation Recomputation in Large Transformer Models: https://arxiv.org/abs/2205.05198 for more details.

We use the following notation: h: hidden size n: number of attention heads p: number of tensor model parallel partitions b: batch size s: sequence length

Initialization

finish_init( quantization_config: megatron.core.quantization.quant_config.QuantizationConfig, )#: Finishes the initialization of the KitchenDotProductAttention module.

forward( query: torch.Tensor, key: torch.Tensor, value: torch.Tensor, attention_mask: torch.Tensor, attn_mask_type: megatron.core.transformer.enums.AttnMaskType = None, attention_bias: torch.Tensor = None, packed_seq_params: Optional[megatron.core.packed_seq_params.PackedSeqParams] = None, )#: Forward.

class core.extensions.kitchen.KitchenSpecProvider( fallback: megatron.core.models.backends.BackendSpecProvider, use_kitchen_attention: bool = False, kitchen_attention_backend: str = 'sdpa', )#

Bases: megatron.core.models.backends.BackendSpecProvider

A protocol for providing the submodules used in Spec building.

Initialization

column_parallel_linear() → type#: Which column parallel linear module kitchen backend uses

row_parallel_linear() → type#: Which row parallel linear module kitchen backend uses

fuse_layernorm_and_linear() → bool#: Does kitchen backend support a single module for layernorm and linear

column_parallel_layer_norm_linear() → Optional[type]#: Which module for sequential layernorm and linear

layer_norm(rms_norm: bool = False, for_qk: bool = False) → type#: Which module to use for layer norm

core_attention() → type#: Which module to use for attention

grouped_mlp_modules( moe_use_grouped_gemm: bool, moe_use_legacy_grouped_gemm: bool, ) → Tuple[type, Optional[megatron.core.transformer.mlp.MLPSubmodules]]#: Which module and submodules to use for grouped mlp

activation_func() → type#: Which module to use for activation function

core.extensions.kitchen#

Module Contents#

Classes#

Functions#

Data#

API#

`core.extensions.kitchen`#