API Guide#
- models package
- tensor_parallel package
- Submodules
- tensor_parallel.cross_entropy module
- tensor_parallel.data module
- tensor_parallel.layers module
ColumnParallelLinear
LinearWithFrozenWeight
LinearWithGradAccumulationAndAsyncCommunication
RowParallelLinear
VocabParallelEmbedding
copy_tensor_model_parallel_attributes()
linear_with_frozen_weight()
linear_with_grad_accumulation_and_async_allreduce()
param_is_not_tensor_parallel_duplicate()
set_defaults_if_not_set_tensor_model_parallel_attributes()
set_tensor_model_parallel_attributes()
- tensor_parallel.mappings module
all_gather_last_dim_from_tensor_parallel_region()
all_to_all()
all_to_all_hp2sp()
all_to_all_sp2hp()
copy_to_tensor_model_parallel_region()
gather_from_sequence_parallel_region()
gather_from_tensor_model_parallel_region()
reduce_from_tensor_model_parallel_region()
reduce_scatter_last_dim_to_tensor_parallel_region()
reduce_scatter_to_sequence_parallel_region()
scatter_to_sequence_parallel_region()
scatter_to_tensor_model_parallel_region()
- tensor_parallel.random module
- tensor_parallel.utils module
- Module contents
CheckpointWithoutOutput
ColumnParallelLinear
RowParallelLinear
VocabParallelEmbedding
broadcast_data()
checkpoint()
copy_tensor_model_parallel_attributes()
copy_to_tensor_model_parallel_region()
gather_from_sequence_parallel_region()
gather_from_tensor_model_parallel_region()
gather_split_1d_tensor()
get_cuda_rng_tracker()
get_expert_parallel_rng_tracker_name()
linear_with_grad_accumulation_and_async_allreduce()
model_parallel_cuda_manual_seed()
param_is_not_tensor_parallel_duplicate()
reduce_from_tensor_model_parallel_region()
reduce_scatter_to_sequence_parallel_region()
scatter_to_sequence_parallel_region()
scatter_to_tensor_model_parallel_region()
set_defaults_if_not_set_tensor_model_parallel_attributes()
set_tensor_model_parallel_attributes()
split_tensor_along_last_dim()
split_tensor_into_1d_equal_chunks()
vocab_parallel_cross_entropy()
- context_parallel package
- pipeline_parallel package
- Submodules
- pipeline_parallel.p2p_communication module
- pipeline_parallel.schedules module
backward_step()
check_first_val_step()
clear_embedding_activation_buffer()
convert_schedule_table_to_order()
custom_backward()
deallocate_output_tensor()
finish_embedding_wgrad_compute()
forward_backward_no_pipelining()
forward_backward_pipelining_with_interleaving()
forward_backward_pipelining_without_interleaving()
forward_step()
forward_step_calc_loss()
get_forward_backward_func()
get_pp_rank_microbatches()
get_schedule_table()
get_tensor_shapes()
set_current_microbatch()
- Module contents
- MCore Custom Fully Sharded Data Parallel (FSDP)
- How to use ?
- Key Features
- Configuration Recommendations
- Design of Custom FSDP
- References
- fusions package
- transformer package
- Submodules
- transformer.attention module
- transformer.dot_product_attention module
- transformer.enums module
- transformer.identity_op module
- transformer.mlp module
- transformer.module module
- transformer.transformer_block module
- transformer.transformer_config module
MLATransformerConfig
TransformerConfig
account_for_embedding_in_pipeline_split
account_for_loss_in_pipeline_split
activation_func
activation_func_clamp_value
activation_func_fp8_input_store
add_bias_linear
add_qkv_bias
apply_query_key_layer_scaling
apply_residual_connection_post_layernorm
apply_rope_fusion
attention_backend
attention_dropout
attention_softmax_in_fp32
bias_activation_fusion
bias_dropout_fusion
calculate_per_token_loss
clone_scatter_output_in_embedding
config_logger_dir
cp_comm_type
cuda_graph_retain_backward_graph
cuda_graph_scope
cuda_graph_use_single_mempool
cuda_graph_warmup_steps
disable_bf16_reduced_precision_matmul
disable_parameter_transpose_cache
distribute_saved_activations
embedding_init_method
embedding_init_method_std
enable_cuda_graph
external_cuda_graph
ffn_hidden_size
first_last_layers_bf16
flash_decode
fp32_residual_connection
fp4
fp4_param
fp4_recipe
fp8
fp8_amax_compute_algo
fp8_amax_history_len
fp8_dot_product_attention
fp8_interval
fp8_margin
fp8_multi_head_attention
fp8_param
fp8_recipe
fp8_wgrad
gated_linear_unit
glu_linear_offset
hetereogenous_dist_checkpoint
heterogeneous_block_specs
hidden_dropout
hidden_size
inference_rng_tracker
inference_sampling_seed
init_method
init_method_std
init_model_with_meta_device
is_hybrid_model
kv_channels
layernorm_epsilon
layernorm_zero_centered_gamma
mamba_head_dim
mamba_num_groups
mamba_num_heads
mamba_state_dim
masked_softmax_fusion
memory_efficient_layer_norm
mlp_chunks_for_prefill
moe_apply_probs_on_input
moe_aux_loss_coeff
moe_deepep_num_sms
moe_enable_deepep
moe_expert_capacity_factor
moe_ffn_hidden_size
moe_grouped_gemm
moe_input_jitter_eps
moe_layer_freq
moe_layer_recompute
moe_pad_expert_input_to_capacity
moe_per_layer_logging
moe_permute_fusion
moe_router_bias_update_rate
moe_router_dtype
moe_router_enable_expert_bias
moe_router_force_load_balancing
moe_router_fusion
moe_router_group_topk
moe_router_load_balancing_type
moe_router_num_groups
moe_router_padding_for_fp8
moe_router_pre_softmax
moe_router_score_function
moe_router_topk
moe_router_topk_limited_devices
moe_router_topk_scaling_factor
moe_shared_expert_intermediate_size
moe_shared_expert_overlap
moe_token_dispatcher_type
moe_token_drop_policy
moe_token_dropping
moe_use_legacy_grouped_gemm
moe_z_loss_coeff
mrope_section
mtp_loss_scaling_factor
mtp_num_layers
multi_latent_attention
no_rope_freq
normalization
num_attention_heads
num_layers
num_layers_at_end_in_bf16
num_layers_at_start_in_bf16
num_layers_in_first_pipeline_stage
num_layers_in_last_pipeline_stage
num_moe_experts
num_query_groups
output_layer_init_method
persist_layer_norm
pipeline_model_parallel_layout
qk_layernorm
quant_recipe
recompute_granularity
recompute_method
recompute_modules
recompute_num_layers
rotary_interleaved
softmax_scale
softmax_type
symmetric_ar_type
test_mode
tp_only_amax_red
transformer_impl
use_fused_weighted_squared_relu
use_kitchen
use_mamba_mem_eff_path
use_te_activation_func
use_te_rng_tracker
window_attn_skip_freq
window_size
- transformer.transformer_layer module
- transformer.utils module
attention_mask_func()
erf_gelu()
gelu_impl()
get_default_causal_mask()
get_linear_layer()
get_sliding_window_causal_mask()
init_cuda_graph_cache()
is_layer_window_attention()
make_sharded_object_for_checkpoint()
make_sharded_tensors_for_checkpoint()
openai_gelu()
set_model_to_sequence_parallel()
sharded_state_dict_default()
toggle_cuda_graphs()
- Module contents
- Mixture of Experts package
- dist_checkpointing package
- Safe Checkpoint Loading
- Subpackages
- Submodules
- dist_checkpointing.serialization module
- dist_checkpointing.mapping module
LocalNonpersistentObject
ShardedBase
ShardedObject
ShardedTensor
allow_shape_mismatch
axis_fragmentations
data
dtype
flattened_range
from_rank_offsets()
from_rank_offsets_flat()
global_coordinates()
global_offset
global_shape
global_slice()
has_regular_grid
init_data()
key
local_chunk_offset_in_global()
local_coordinates()
local_shape
max_allowed_chunks()
narrow()
prepend_axis_num
replica_id
validate_metadata_integrity()
without_data()
ShardedTensorFactory
apply_factories()
apply_factory_merges()
is_main_replica()
- dist_checkpointing.optimizer module
- dist_checkpointing.core module
- dist_checkpointing.dict_utils module
- dist_checkpointing.utils module
add_prefix_for_sharding()
apply_prefix_mapping()
debug_msg()
debug_time()
extract_nonpersistent()
extract_sharded_base()
extract_sharded_tensors()
extract_sharded_tensors_and_factories()
extract_sharded_tensors_or_nonpersistent()
force_all_tensors_to_non_fp8()
logger_stack()
replace_prefix_for_sharding()
zip_strict()
- Module contents
- Distributed Optimizer
- distributed package
- datasets package
- Data Pipeline
- Submodules
- datasets.blended_megatron_dataset_config module
- datasets.blended_megatron_dataset_builder module
- datasets.megatron_tokenizer module
- datasets.indexed_dataset module
- datasets.megatron_dataset module
- datasets.gpt_dataset module
- datasets.masked_dataset module
- datasets.bert_dataset module
- datasets.t5_dataset module
- datasets.blended_dataset module
- datasets.utils module
- Module contents
- Multi-Latent Attention
- Microbatches Calculator
- Module contents
ConstantNumMicroBatchesCalculator
NumMicroBatchesCalculator
RampupBatchsizeNumMicroBatchesCalculator
destroy_num_microbatches_calculator()
get_current_global_batch_size()
get_current_running_global_batch_size()
get_micro_batch_size()
get_num_microbatches()
init_num_microbatches_calculator()
reconfigure_num_microbatches_calculator()
unset_num_microbatches_calculator()
update_num_microbatches()
- Module contents
- Optimizer Parameters Scheduler
- Optimizer CPU offload package
- Multi-Token Prediction (MTP)
- New Tokenizer System