API Guide#
- models package
- tensor_parallel package
- Submodules
- tensor_parallel.cross_entropy module
- tensor_parallel.data module
- tensor_parallel.layers module
ColumnParallelLinearLinearWithFrozenWeightLinearWithGradAccumulationAndAsyncCommunicationRowParallelLinearVocabParallelEmbeddingcopy_tensor_model_parallel_attributes()linear_with_frozen_weight()linear_with_grad_accumulation_and_async_allreduce()param_is_not_tensor_parallel_duplicate()set_defaults_if_not_set_tensor_model_parallel_attributes()set_tensor_model_parallel_attributes()
- tensor_parallel.mappings module
all_gather_last_dim_from_tensor_parallel_region()all_to_all()all_to_all_hp2sp()all_to_all_sp2hp()copy_to_tensor_model_parallel_region()gather_from_sequence_parallel_region()gather_from_tensor_model_parallel_region()reduce_from_tensor_model_parallel_region()reduce_scatter_last_dim_to_tensor_parallel_region()reduce_scatter_to_sequence_parallel_region()scatter_to_sequence_parallel_region()scatter_to_tensor_model_parallel_region()
- tensor_parallel.random module
- tensor_parallel.utils module
- Module contents
CheckpointWithoutOutputColumnParallelLinearRowParallelLinearVocabParallelEmbeddingbroadcast_data()checkpoint()copy_tensor_model_parallel_attributes()copy_to_tensor_model_parallel_region()gather_from_sequence_parallel_region()gather_from_tensor_model_parallel_region()gather_split_1d_tensor()get_cuda_rng_tracker()get_expert_parallel_rng_tracker_name()linear_with_grad_accumulation_and_async_allreduce()model_parallel_cuda_manual_seed()param_is_not_tensor_parallel_duplicate()reduce_from_tensor_model_parallel_region()reduce_scatter_to_sequence_parallel_region()scatter_to_sequence_parallel_region()scatter_to_tensor_model_parallel_region()set_defaults_if_not_set_tensor_model_parallel_attributes()set_tensor_model_parallel_attributes()split_tensor_along_last_dim()split_tensor_into_1d_equal_chunks()vocab_parallel_cross_entropy()
- context_parallel package
- pipeline_parallel package
- Submodules
- pipeline_parallel.p2p_communication module
- pipeline_parallel.schedules module
backward_step()check_first_val_step()clear_embedding_activation_buffer()convert_schedule_table_to_order()custom_backward()deallocate_output_tensor()finish_embedding_wgrad_compute()forward_backward_no_pipelining()forward_backward_pipelining_with_interleaving()forward_backward_pipelining_without_interleaving()forward_step()forward_step_calc_loss()get_forward_backward_func()get_pp_rank_microbatches()get_schedule_table()get_tensor_shapes()set_current_microbatch()
- Module contents
- MCore Custom Fully Sharded Data Parallel (FSDP)
- How to use ?
- Key Features
- Configuration Recommendations
- Design of Custom FSDP
- References
- fusions package
- transformer package
- Submodules
- transformer.attention module
- transformer.dot_product_attention module
- transformer.enums module
- transformer.identity_op module
- transformer.mlp module
- transformer.module module
- transformer.transformer_block module
- transformer.transformer_config module
MLATransformerConfigTransformerConfigaccount_for_embedding_in_pipeline_splitaccount_for_loss_in_pipeline_splitactivation_funcactivation_func_clamp_valueactivation_func_fp8_input_storeadd_bias_linearadd_qkv_biasapply_query_key_layer_scalingapply_residual_connection_post_layernormapply_rope_fusionattention_backendattention_dropoutattention_softmax_in_fp32bias_activation_fusionbias_dropout_fusioncalculate_per_token_lossclone_scatter_output_in_embeddingconfig_logger_dircp_comm_typecuda_graph_retain_backward_graphcuda_graph_scopecuda_graph_use_single_mempoolcuda_graph_warmup_stepsdisable_bf16_reduced_precision_matmuldisable_parameter_transpose_cachedistribute_saved_activationsembedding_init_methodembedding_init_method_stdenable_cuda_graphexternal_cuda_graphffn_hidden_sizefirst_last_layers_bf16flash_decodefp32_residual_connectionfp4fp4_paramfp4_recipefp8fp8_amax_compute_algofp8_amax_history_lenfp8_dot_product_attentionfp8_intervalfp8_marginfp8_multi_head_attentionfp8_paramfp8_recipefp8_wgradgated_linear_unitglu_linear_offsethetereogenous_dist_checkpointheterogeneous_block_specshidden_dropouthidden_sizeinference_rng_trackerinference_sampling_seedinit_methodinit_method_stdinit_model_with_meta_deviceis_hybrid_modelkv_channelslayernorm_epsilonlayernorm_zero_centered_gammamamba_head_dimmamba_num_groupsmamba_num_headsmamba_state_dimmasked_softmax_fusionmemory_efficient_layer_normmlp_chunks_for_prefillmoe_apply_probs_on_inputmoe_aux_loss_coeffmoe_deepep_num_smsmoe_enable_deepepmoe_expert_capacity_factormoe_ffn_hidden_sizemoe_grouped_gemmmoe_input_jitter_epsmoe_layer_freqmoe_layer_recomputemoe_pad_expert_input_to_capacitymoe_per_layer_loggingmoe_permute_fusionmoe_router_bias_update_ratemoe_router_dtypemoe_router_enable_expert_biasmoe_router_force_load_balancingmoe_router_fusionmoe_router_group_topkmoe_router_load_balancing_typemoe_router_num_groupsmoe_router_padding_for_fp8moe_router_pre_softmaxmoe_router_score_functionmoe_router_topkmoe_router_topk_limited_devicesmoe_router_topk_scaling_factormoe_shared_expert_intermediate_sizemoe_shared_expert_overlapmoe_token_dispatcher_typemoe_token_drop_policymoe_token_droppingmoe_use_legacy_grouped_gemmmoe_z_loss_coeffmrope_sectionmtp_loss_scaling_factormtp_num_layersmulti_latent_attentionno_rope_freqnormalizationnum_attention_headsnum_layersnum_layers_at_end_in_bf16num_layers_at_start_in_bf16num_layers_in_first_pipeline_stagenum_layers_in_last_pipeline_stagenum_moe_expertsnum_query_groupsoutput_layer_init_methodpersist_layer_normpipeline_model_parallel_layoutqk_layernormquant_reciperecompute_granularityrecompute_methodrecompute_modulesrecompute_num_layersrotary_interleavedsoftmax_scalesoftmax_typesymmetric_ar_typetest_modetp_only_amax_redtransformer_impluse_fused_weighted_squared_reluuse_kitchenuse_mamba_mem_eff_pathuse_te_activation_funcuse_te_rng_trackerwindow_attn_skip_freqwindow_size
- transformer.transformer_layer module
- transformer.utils module
attention_mask_func()erf_gelu()gelu_impl()get_default_causal_mask()get_linear_layer()get_sliding_window_causal_mask()init_cuda_graph_cache()is_layer_window_attention()make_sharded_object_for_checkpoint()make_sharded_tensors_for_checkpoint()openai_gelu()set_model_to_sequence_parallel()sharded_state_dict_default()toggle_cuda_graphs()
- Module contents
- Mixture of Experts package
- dist_checkpointing package
- Safe Checkpoint Loading
- Subpackages
- Submodules
- dist_checkpointing.serialization module
- dist_checkpointing.mapping module
LocalNonpersistentObjectShardedBaseShardedObjectShardedTensorallow_shape_mismatchaxis_fragmentationsdatadtypeflattened_rangefrom_rank_offsets()from_rank_offsets_flat()global_coordinates()global_offsetglobal_shapeglobal_slice()has_regular_gridinit_data()keylocal_chunk_offset_in_global()local_coordinates()local_shapemax_allowed_chunks()narrow()prepend_axis_numreplica_idvalidate_metadata_integrity()without_data()
ShardedTensorFactoryapply_factories()apply_factory_merges()is_main_replica()
- dist_checkpointing.optimizer module
- dist_checkpointing.core module
- dist_checkpointing.dict_utils module
- dist_checkpointing.utils module
add_prefix_for_sharding()apply_prefix_mapping()debug_msg()debug_time()extract_nonpersistent()extract_sharded_base()extract_sharded_tensors()extract_sharded_tensors_and_factories()extract_sharded_tensors_or_nonpersistent()force_all_tensors_to_non_fp8()logger_stack()replace_prefix_for_sharding()zip_strict()
- Module contents
- Distributed Optimizer
- distributed package
- datasets package
- Data Pipeline
- Submodules
- datasets.blended_megatron_dataset_config module
- datasets.blended_megatron_dataset_builder module
- datasets.megatron_tokenizer module
- datasets.indexed_dataset module
- datasets.megatron_dataset module
- datasets.gpt_dataset module
- datasets.masked_dataset module
- datasets.bert_dataset module
- datasets.t5_dataset module
- datasets.blended_dataset module
- datasets.utils module
- Module contents
- Multi-Latent Attention
- Microbatches Calculator
- Module contents
ConstantNumMicroBatchesCalculatorNumMicroBatchesCalculatorRampupBatchsizeNumMicroBatchesCalculatordestroy_num_microbatches_calculator()get_current_global_batch_size()get_current_running_global_batch_size()get_micro_batch_size()get_num_microbatches()init_num_microbatches_calculator()reconfigure_num_microbatches_calculator()unset_num_microbatches_calculator()update_num_microbatches()
- Module contents
- Optimizer Parameters Scheduler
- Optimizer CPU offload package
- Multi-Token Prediction (MTP)
- New Tokenizer System