core.extensions.kitchen#

Module Contents#

Classes#

KitchenConfigType

Configuration object types in config dictionary

QFlashAttentionParamsConfigSchema

Dataclass to parse values from config dict of ‘QFlashAttentionParams’ type

QAttentionParamsConfigSchema

Dataclass to parse values from config dict of ‘QAttentionParams’ type

QLinearParamsConfigSchema

Dataclass to parse values from config dict of ‘QLinearParams’ type

CompoundParamsConfigSchema

Dataclass to parse values from config dict of ‘CompoundParams’ type

KitchenQuantizationParams

Quantization parameters used for kitchen extensions

KitchenLinear

Wrapper for Kitchen’s Linear layer.

KitchenColumnParallelLinear

Wrapper for the Kitchen’s Linear layer but specialized similar to megatron’s ColumnParallelLinear layer.

KitchenRowParallelLinear

Wrapper for Kitchen’s Linear layer but specialized similar to megatron’s RowParallelLinear layer.

KitchenGroupedLinear

Wrapper for Kitchen’s GroupedLinear layer.

KitchenColumnParallelGroupedLinear

Wrapper for Kitchen’s GroupedLinear layer but specialized to column-parallel style.

KitchenRowParallelGroupedLinear

Wrapper for Kitchen’s GroupedLinear layer but specialized to row-parallel style.

KitchenLayerNormColumnParallelLinear

Wrapper for Kitchen’s LayerNormLinear layer that combines layernorm and linear layers

KitchenFlashAttention

Flash Attention implementation for Kitchen.

KitchenDotProductAttention

Region where selective activation recomputation is applied. This region is memory intensive but less compute intensive which makes activation checkpointing more efficient for LLMs (20B+). See Reducing Activation Recomputation in Large Transformer Models: https://arxiv.org/abs/2205.05198 for more details.

KitchenSpecProvider

A protocol for providing the submodules used in Spec building.

Functions#

Data#

API#

core.extensions.kitchen.logger#

‘getLogger(…)’

core.extensions.kitchen._KITCHEN_CONFIG_TYPE_KEY#

‘kitchen_config_type’

class core.extensions.kitchen.KitchenConfigType(*args, **kwds)#

Bases: enum.Enum

Configuration object types in config dictionary

Initialization

QLINEAR_PARAMS#

‘QLinearParams’

QATTENTION_PARAMS#

‘QAttentionParams’

QFLASHATTENTION_PARAMS#

‘QFlashAttentionParams’

COMPOUND_PARAMS#

‘CompoundParams’

class core.extensions.kitchen.QFlashAttentionParamsConfigSchema#

Dataclass to parse values from config dict of ‘QFlashAttentionParams’ type

kitchen_config_type: core.extensions.kitchen.KitchenConfigType#

None

recipe_name: str#

None

classmethod parse_config_dict(
config_dict: Dict[Any, Any],
) core.extensions.kitchen.QFlashAttentionParamsConfigSchema#

Parse config dictionary and return a schema instance.

Expected config format: {“kitchen_config_type”: “QFlashAttentionParams”, “recipe_name”: }

classmethod get_expected_keys() Set[str]#

Get expected keys from the dataclass fields.

__post_init__()#
to_kitchen_qfa() nvidia_kitchen.fa_params.QFlashAttentionParams#

Converts to kitchen library’s QFlashAttentionParams object.

class core.extensions.kitchen.QAttentionParamsConfigSchema#

Dataclass to parse values from config dict of ‘QAttentionParams’ type

kitchen_config_type: core.extensions.kitchen.KitchenConfigType#

None

recipe_idx: int#

None

classmethod parse_config_dict(
config_dict: Dict[Any, Any],
) core.extensions.kitchen.QAttentionParamsConfigSchema#

Parse config dictionary and return a schema instance.

Expected config format: {“kitchen_config_type”: “QLinearParams”, “recipe_idx”: }

classmethod get_expected_keys() Set[str]#

Get expected keys from the dataclass fields.

__post_init__()#
to_kitchen_qattention() nvidia_kitchen.attention.QAttentionParams#

Converts to kitchen library’s QAttentionParams object.

class core.extensions.kitchen.QLinearParamsConfigSchema#

Dataclass to parse values from config dict of ‘QLinearParams’ type

kitchen_config_type: core.extensions.kitchen.KitchenConfigType#

None

recipe_idx: int#

None

classmethod parse_config_dict(
config_dict: Dict[Any, Any],
) core.extensions.kitchen.QLinearParamsConfigSchema#

Parse config dictionary and return a schema instance.

Expected config format: {“kitchen_config_type”: “QLinearParams”, “recipe_idx”: }

classmethod get_expected_keys() Set[str]#

Get expected keys from the dataclass fields.

__post_init__()#
to_kitchen_qlinear() nvidia_kitchen.config.QLinearParams#

Converts to kitchen library’s QLinearParams object.

class core.extensions.kitchen.CompoundParamsConfigSchema#

Dataclass to parse values from config dict of ‘CompoundParams’ type

kitchen_config_type: core.extensions.kitchen.KitchenConfigType#

None

configs: Dict[Any, Any]#

None

q_linear_params: Optional[core.extensions.kitchen.QLinearParamsConfigSchema]#

None

q_attention_params: Optional[core.extensions.kitchen.QAttentionParamsConfigSchema]#

None

q_fa_params: Optional[core.extensions.kitchen.QFlashAttentionParamsConfigSchema]#

None

classmethod parse_config_dict(
config_dict: Dict[Any, Any],
) core.extensions.kitchen.CompoundParamsConfigSchema#

Parse config dictionary and return a schema instance.

Expected config format: { “kitchen_config_type”: “CompoundParams”, “configs”: [ {“kitchen_config_type”: “QLinearParams”, “recipe_idx”: }, {“kitchen_config_type”: “QAttentionParams”, “recipe_idx”: }, ] }

or { “kitchen_config_type”: “CompoundParams”, “configs”: [ {“kitchen_config_type”: “QLinearParams”, “recipe_idx”: }, {“kitchen_config_type”: “QFlashAttentionParams”, “recipe_name”: }, ] }

classmethod get_expected_keys() Set[str]#

Get expected keys from the dataclass fields.

__post_init__()#
get_qlinear_params() Optional[nvidia_kitchen.config.QLinearParams]#

Returns the QLinearParams object for the compound params.

get_qattention_params() Optional[nvidia_kitchen.attention.QAttentionParams]#

Returns the QAttentionParams object for the compound params.

get_qfa_params() Optional[nvidia_kitchen.fa_params.QFlashAttentionParams]#

Returns the QFlashAttentionParams object for the compound params.

class core.extensions.kitchen.KitchenQuantizationParams#

Quantization parameters used for kitchen extensions

qlinear_params: Optional[nvidia_kitchen.config.QLinearParams]#

None

match_input: megatron.core.quantization.quant_config.MatchContext#

None

params_config_key: str#

None

qattention_params: Optional[nvidia_kitchen.attention.QAttentionParams]#

None

qfa_params: Optional[nvidia_kitchen.fa_params.QFlashAttentionParams]#

None

static parse_from_config(
quant_config: megatron.core.quantization.quant_config.QuantizationConfig,
) core.extensions.kitchen.KitchenQuantizationParams#

Parses quantization config for a layer or throw an error.

core.extensions.kitchen._get_extra_kitchen_kwargs(
config: megatron.core.transformer.transformer_config.TransformerConfig,
)#
class core.extensions.kitchen.KitchenLinear(
input_size: int,
output_size: int,
*,
parallel_mode: Optional[str],
config: megatron.core.model_parallel_config.ModelParallelConfig,
init_method: Callable,
bias: bool,
skip_bias_add: bool,
skip_weight_param_allocation: bool,
tp_comm_buffer_name: Optional[str] = None,
layer_number: Optional[int] = None,
is_expert: bool = False,
tp_group: Optional[torch.distributed.ProcessGroup] = None,
)#

Bases: nvidia_kitchen.Linear

Wrapper for Kitchen’s Linear layer.

Note that if Megatron’s parallel_state has not been initialized yet, the tp_group passed to Kitchen will be None and must be set later via set_tensor_parallel_group().

parallel_mode currently supports 3 different values: - “column”: Split the weight matrix along output dimension (for KitchenColumnParallelLinear) - “row”: Split the weight matrix along input dimension (for KitchenRowParallelLinear) - “duplicated”: No tensor parallelism and weight is duplicated across TP ranks - Note: For expert linear layers, we will disable communication logic here as TP communication is handled in token_dispatcher.

Initialization

finish_init(
quantization_config: megatron.core.quantization.quant_config.QuantizationConfig,
)#

Required post-init of quantization configuration.

forward(x)#

Forward.

sharded_state_dict(prefix='', sharded_offsets=(), metadata=None)#

Replicate cross TP/DP.

class core.extensions.kitchen.KitchenColumnParallelLinear(
input_size: int,
output_size: int,
*,
config: megatron.core.model_parallel_config.ModelParallelConfig,
init_method: Callable,
gather_output: bool,
bias: bool,
skip_bias_add: bool,
is_expert: bool,
skip_weight_param_allocation: bool = False,
tp_comm_buffer_name: Optional[str] = None,
layer_number: Optional[int] = None,
tp_group: Optional[torch.distributed.ProcessGroup] = None,
)#

Bases: core.extensions.kitchen.KitchenLinear

Wrapper for the Kitchen’s Linear layer but specialized similar to megatron’s ColumnParallelLinear layer.

Initialization

sharded_state_dict(prefix='', sharded_offsets=(), metadata=None)#

Sharding along axis 0, bias sharded

__repr__()#
class core.extensions.kitchen.KitchenRowParallelLinear(
input_size: int,
output_size: int,
*,
config: megatron.core.model_parallel_config.ModelParallelConfig,
init_method: Callable,
bias: bool,
input_is_parallel: bool,
skip_bias_add: bool,
is_expert: bool,
tp_comm_buffer_name: Optional[str] = None,
layer_number: Optional[int] = None,
tp_group: Optional[torch.distributed.ProcessGroup] = None,
)#

Bases: core.extensions.kitchen.KitchenLinear

Wrapper for Kitchen’s Linear layer but specialized similar to megatron’s RowParallelLinear layer.

Initialization

sharded_state_dict(prefix='', sharded_offsets=(), metadata=None)#

Sharding along axis 1, bias not sharded

__repr__()#
class core.extensions.kitchen.KitchenGroupedLinear(
num_gemms: int,
input_size: int,
output_size: int,
*,
parallel_mode: Optional[str],
config: megatron.core.model_parallel_config.ModelParallelConfig,
init_method: Callable,
bias: bool,
skip_bias_add: bool,
is_expert: bool = False,
tp_comm_buffer_name: Optional[str] = None,
layer_number: Optional[int] = None,
tp_group: Optional[torch.distributed.ProcessGroup] = None,
)#

Bases: nvidia_kitchen.GroupedLinear

Wrapper for Kitchen’s GroupedLinear layer.

Note that if Megatron’s parallel_state has not been initialized yet, the tp_group passed to TE will be None and must be set later via set_tensor_parallel_group().

Initialization

finish_init(
quantization_config: megatron.core.quantization.quant_config.QuantizationConfig,
) None#

Required post-init of quantization configuration.

forward(x, m_splits)#

Forward.

_encode_extra_state(state)#
_decode_extra_state(state)#
_split_extra_state(state)#
_sharded_state_dict_grouped(
tp_axis_map,
prefix='',
sharded_offsets=(),
metadata=None,
)#

prefix should be module_name to make keys identical to sequetial ones.

class core.extensions.kitchen.KitchenColumnParallelGroupedLinear(
num_gemms: int,
input_size: int,
output_size: int,
*,
config: megatron.core.model_parallel_config.ModelParallelConfig,
init_method: Callable,
bias: bool,
skip_bias_add: bool,
is_expert: bool,
tp_comm_buffer_name: Optional[str] = None,
layer_number: Optional[int] = None,
tp_group: Optional[torch.distributed.ProcessGroup] = None,
)#

Bases: core.extensions.kitchen.KitchenGroupedLinear

Wrapper for Kitchen’s GroupedLinear layer but specialized to column-parallel style.

Initialization

sharded_state_dict(prefix='', sharded_offsets=(), metadata=None)#

For each gemm, sharding along axis 0, bias sharded. Assume sharded_offsets[-1] is the expert parallel offset.

class core.extensions.kitchen.KitchenRowParallelGroupedLinear(
num_gemms: int,
input_size: int,
output_size: int,
*,
config: megatron.core.model_parallel_config.ModelParallelConfig,
init_method: Callable,
bias: bool,
skip_bias_add: bool,
is_expert: bool,
tp_comm_buffer_name: Optional[str] = None,
layer_number: Optional[int] = None,
tp_group: Optional[torch.distributed.ProcessGroup] = None,
)#

Bases: core.extensions.kitchen.KitchenGroupedLinear

Wrapper for Kitchen’s GroupedLinear layer but specialized to row-parallel style.

Initialization

sharded_state_dict(prefix='', sharded_offsets=(), metadata=None)#

For each gemm, sharding along axis 1, bias not sharded. Assume sharded_offsets[-1] is the expert parallel offset.

class core.extensions.kitchen.KitchenLayerNormColumnParallelLinear(
input_size: int,
output_size: int,
*,
config: megatron.core.transformer.transformer_config.TransformerConfig,
init_method: Callable,
gather_output: bool,
bias: bool,
skip_bias_add: bool,
is_expert: bool,
skip_weight_param_allocation: bool = False,
layer_number: Optional[int] = None,
tp_comm_buffer_name: Optional[str] = None,
tp_group: Optional[torch.distributed.ProcessGroup] = None,
)#

Bases: nvidia_kitchen.LayerNormLinear

Wrapper for Kitchen’s LayerNormLinear layer that combines layernorm and linear layers

Initialization

finish_init(
quantization_config: megatron.core.quantization.quant_config.QuantizationConfig,
) None#

Required post-init of quantization configuration.

forward(x)#

Forward.

sharded_state_dict(prefix='', sharded_offsets=(), metadata=None)#

Sharding along axis 0, bias sharded

__repr__()#
class core.extensions.kitchen.KitchenFlashAttention(
config: megatron.core.transformer.transformer_config.TransformerConfig,
layer_number: int,
attn_mask_type: megatron.core.transformer.enums.AttnMaskType,
attention_type: str,
attention_dropout: Optional[float] = None,
softmax_scale: Optional[float] = None,
cp_comm_type: Optional[str] = None,
pg_collection: Optional[megatron.core.process_groups_config.ProcessGroupCollection] = None,
)#

Bases: megatron.core.transformer.module.MegatronModule

Flash Attention implementation for Kitchen.

Initialization

finish_init(
quantization_config: megatron.core.quantization.quant_config.QuantizationConfig,
)#

Finishes the initialization of the KitchenFlashAttention module.

forward(
query: torch.Tensor,
key: torch.Tensor,
value: torch.Tensor,
attention_mask: torch.Tensor,
attn_mask_type: megatron.core.transformer.enums.AttnMaskType = None,
attention_bias: torch.Tensor = None,
packed_seq_params: Optional[megatron.core.packed_seq_params.PackedSeqParams] = None,
)#

Forward.

class core.extensions.kitchen.KitchenDotProductAttention(
config: megatron.core.transformer.transformer_config.TransformerConfig,
layer_number: int,
attn_mask_type: megatron.core.transformer.enums.AttnMaskType,
attention_type: str,
attention_dropout: Optional[float] = None,
softmax_scale: Optional[float] = None,
cp_comm_type: Optional[str] = None,
pg_collection: Optional[megatron.core.process_groups_config.ProcessGroupCollection] = None,
)#

Bases: megatron.core.transformer.module.MegatronModule

Region where selective activation recomputation is applied. This region is memory intensive but less compute intensive which makes activation checkpointing more efficient for LLMs (20B+). See Reducing Activation Recomputation in Large Transformer Models: https://arxiv.org/abs/2205.05198 for more details.

We use the following notation: h: hidden size n: number of attention heads p: number of tensor model parallel partitions b: batch size s: sequence length

Initialization

finish_init(
quantization_config: megatron.core.quantization.quant_config.QuantizationConfig,
)#

Finishes the initialization of the KitchenDotProductAttention module.

forward(
query: torch.Tensor,
key: torch.Tensor,
value: torch.Tensor,
attention_mask: torch.Tensor,
attn_mask_type: megatron.core.transformer.enums.AttnMaskType = None,
attention_bias: torch.Tensor = None,
packed_seq_params: Optional[megatron.core.packed_seq_params.PackedSeqParams] = None,
)#

Forward.

class core.extensions.kitchen.KitchenSpecProvider(
fallback: megatron.core.models.backends.BackendSpecProvider,
use_kitchen_attention: bool = False,
kitchen_attention_backend: str = 'sdpa',
)#

Bases: megatron.core.models.backends.BackendSpecProvider

A protocol for providing the submodules used in Spec building.

Initialization

column_parallel_linear() type#

Which column parallel linear module kitchen backend uses

row_parallel_linear() type#

Which row parallel linear module kitchen backend uses

fuse_layernorm_and_linear() bool#

Does kitchen backend support a single module for layernorm and linear

column_parallel_layer_norm_linear() Optional[type]#

Which module for sequential layernorm and linear

layer_norm(rms_norm: bool = False, for_qk: bool = False) type#

Which module to use for layer norm

core_attention() type#

Which module to use for attention

grouped_mlp_modules(
moe_use_grouped_gemm: bool,
moe_use_legacy_grouped_gemm: bool,
) Tuple[type, Optional[megatron.core.transformer.mlp.MLPSubmodules]]#

Which module and submodules to use for grouped mlp

activation_func() type#

Which module to use for activation function