Skip to main content
Ctrl+K
Megatron-LM - Home Megatron-LM - Home

Megatron-LM

  • GitHub
Megatron-LM - Home Megatron-LM - Home

Megatron-LM

  • GitHub

Table of Contents

Get Started

  • Quick Start

Basic Usage

  • Data Preparation
  • Training Examples
  • Parallelism Strategies Guide

Supported Models

  • Supported Models
    • Language Models
    • Multimodal Models
    • Llama, Mistral and other Llama-like model support in Megatron-LM

Advanced Features

  • Mixture of Experts
    • Multi-Token Prediction (MTP)
    • Multi-Latent Attention
  • context_parallel package
  • Megatron FSDP
  • Distributed Optimizer
  • Optimizer CPU Offload
  • Custom Pipeline Model Parallel Layout
  • Megatron Energon
  • Megatron RL
  • Tokenizers

Developer Guide

  • Contributing to Megatron-LM
  • How to Submit a PR
  • Oncall Overview

Discussions

  • Discussions

API Reference

  • API Guide
    • Model APIs
      • models package
        • models.gpt package
        • models.t5 package
        • models.bert package
      • models.gpt package
      • models.bert package
      • models.t5 package
    • Core APIs
      • transformer package
      • tensor_parallel package
      • pipeline_parallel package
      • fusions package
      • distributed package
      • datasets package
      • dist_checkpointing package
        • dist_checkpointing.strategies package
      • dist_checkpointing.strategies package
    • Internal Utilities
      • Microbatches Calculator
      • Optimizer Parameters Scheduler
  • API Reference
    • core
      • core.distributed
        • core.distributed.fsdp
        • core.distributed.reduce_scatter_with_fp32_accumulation
        • core.distributed.finalize_model_grads
        • core.distributed.param_and_grad_buffer
        • core.distributed.torch_fully_sharded_data_parallel_config
        • core.distributed.distributed_data_parallel
        • core.distributed.data_parallel_base
        • core.distributed.distributed_data_parallel_config
        • core.distributed.torch_fully_sharded_data_parallel
      • core.post_training
        • core.post_training.modelopt
      • core.dist_checkpointing
        • core.dist_checkpointing.strategies
        • core.dist_checkpointing.dict_utils
        • core.dist_checkpointing.exchange_utils
        • core.dist_checkpointing.state_dict_utils
        • core.dist_checkpointing.optimizer
        • core.dist_checkpointing.utils
        • core.dist_checkpointing.core
        • core.dist_checkpointing.tensor_aware_state_dict
        • core.dist_checkpointing.validation
        • core.dist_checkpointing.serialization
        • core.dist_checkpointing.mapping
      • core.pipeline_parallel
        • core.pipeline_parallel.schedules
        • core.pipeline_parallel.combined_1f1b
        • core.pipeline_parallel.utils
        • core.pipeline_parallel.bridge_communicator
        • core.pipeline_parallel.p2p_communication
      • core.fusions
        • core.fusions.fused_bias_gelu
        • core.fusions.fused_cross_entropy
        • core.fusions.fused_pad_routing_map
        • core.fusions.fused_layer_norm
        • core.fusions.fused_mla_yarn_rope_apply
        • core.fusions.fused_bias_geglu
        • core.fusions.fused_weighted_squared_relu
        • core.fusions.fused_indices_converter
        • core.fusions.fused_bias_swiglu
        • core.fusions.fused_softmax
        • core.fusions.fused_bias_dropout
      • core.inference
        • core.inference.contexts
        • core.inference.engines
        • core.inference.model_inference_wrappers
        • core.inference.text_generation_controllers
        • core.inference.text_generation_server
        • core.inference.sampling_params
        • core.inference.headers
        • core.inference.scheduler
        • core.inference.unified_memory
        • core.inference.async_stream
        • core.inference.data_parallel_inference_coordinator
        • core.inference.utils
        • core.inference.inference_request
        • core.inference.inference_client
        • core.inference.communication_utils
        • core.inference.common_inference_params
        • core.inference.batch_dimensions_utils
      • core.transformer
        • core.transformer.moe
        • core.transformer.custom_layers
        • core.transformer.attention
        • core.transformer.transformer_layer
        • core.transformer.multi_latent_attention
        • core.transformer.transformer_config
        • core.transformer.torch_norm
        • core.transformer.transformer_block
        • core.transformer.torch_layer_norm
        • core.transformer.pipeline_parallel_layer_layout
        • core.transformer.module
        • core.transformer.fsdp_dtensor_checkpoint
        • core.transformer.utils
        • core.transformer.spec_utils
        • core.transformer.mlp
        • core.transformer.dot_product_attention
        • core.transformer.multi_token_prediction
        • core.transformer.enums
        • core.transformer.cuda_graphs
        • core.transformer.identity_op
      • core.quantization
        • core.quantization.quant_config
        • core.quantization.utils
      • core.optimizer
        • core.optimizer.cpu_offloading
        • core.optimizer.qk_clip
        • core.optimizer.distrib_optimizer
        • core.optimizer.optimizer
        • core.optimizer.grad_scaler
        • core.optimizer.clip_grads
        • core.optimizer.optimizer_config
      • core.export
        • core.export.trtllm
        • core.export.data_type
        • core.export.model_type
        • core.export.export_config
      • core.tokenizers
        • core.tokenizers.text
        • core.tokenizers.megatron_tokenizer
        • core.tokenizers.base_tokenizer
      • core.tensor_parallel
        • core.tensor_parallel.cross_entropy
        • core.tensor_parallel.utils
        • core.tensor_parallel.random
        • core.tensor_parallel.inference_layers
        • core.tensor_parallel.mappings
        • core.tensor_parallel.layers
        • core.tensor_parallel.data
      • core.ssm
        • core.ssm.mamba_hybrid_layer_allocation
        • core.ssm.triton_cache_manager
        • core.ssm.mamba_layer
        • core.ssm.mlp_layer
        • core.ssm.mamba_context_parallel
        • core.ssm.mamba_mixer
        • core.ssm.mamba_block
      • core.models
        • core.models.vision
        • core.models.bert
        • core.models.common
        • core.models.mimo
        • core.models.retro
        • core.models.huggingface
        • core.models.multimodal
        • core.models.mamba
        • core.models.T5
        • core.models.gpt
        • core.models.backends
      • core.extensions
        • core.extensions.transformer_engine_spec_provider
        • core.extensions.transformer_engine
        • core.extensions.kitchen
      • core.datasets
        • core.datasets.retro
        • core.datasets.object_storage_utils
        • core.datasets.bert_dataset
        • core.datasets.t5_dataset
        • core.datasets.gpt_dataset
        • core.datasets.blended_megatron_dataset_config
        • core.datasets.indexed_dataset
        • core.datasets.megatron_dataset
        • core.datasets.blended_megatron_dataset_builder
        • core.datasets.utils
        • core.datasets.megatron_tokenizer
        • core.datasets.helpers
        • core.datasets.blended_dataset
        • core.datasets.utils_s3
        • core.datasets.masked_dataset
        • core.datasets.multimodal_dataset
      • core.energy_monitor
      • core.nccl_allocator
      • core.hyper_comm_grid
      • core.rerun_state_machine
      • core.inference_params
      • core.num_microbatches_calculator
      • core.msc_utils
      • core.config_logger
      • core.jit
      • core.packed_seq_params
      • core.model_parallel_config
      • core.activations
      • core.timers
      • core.process_groups_config
      • core.utils
      • core.safe_globals
      • core.full_cuda_graph
      • core.enums
      • core.config
      • core.optimizer_param_scheduler
      • core.fp4_utils
      • core.parallel_state
      • core.fp8_utils
      • core.package_info
  • API Guide
  • Core APIs

Core APIs#

Low-level API reference for core Megatron components.

  • transformer package
  • tensor_parallel package
  • pipeline_parallel package
  • fusions package
  • distributed package
  • datasets package
  • Data Pipeline
    • Data pre-processing
    • Data loading: construction
    • Data loading: implementation
  • dist_checkpointing package
    • Safe Checkpoint Loading
    • Checkpointing Distributed Optimizer
    • Subpackages
  • dist_checkpointing.strategies package

previous

models.bert package

next

transformer package

NVIDIA NVIDIA
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2025, NVIDIA Corporation.