Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to the Migration Guide for information on getting started.

Migrate Trainer Configuration from NeMo 1.0 to NeMo 2.0

In NeMo 2.0, the trainer configuration has been updated to use the nemo.lightning.Trainer class. This guide will help you migrate your trainer setup.

NeMo 1.0 (Previous Release)

In NeMo 1.0, the trainer was configured in the YAML configuration file.

trainer:
  num_nodes: 16
  devices: 8
  accelerator: gpu
  precision: bf16
  logger: False # logger provided by exp_manager
  max_epochs: null
  max_steps: 75000 # consumed_samples = global_step * global_batch_size
  max_time: "05:23:30:00"
  log_every_n_steps: 10
  val_check_interval: 2000
  limit_val_batches: 50
  limit_test_batches: 50
  accumulate_grad_batches: 1
  gradient_clip_val: 1.0

NeMo 2.0 (New Release)

In NeMo 2.0, the trainer is configured using the nemo.lightning.Trainer class.

from nemo import lightning as nl

trainer = nl.Trainer(
    num_nodes=16,
    devices=8,
    accelerator="gpu",
    plugins=nl.MegatronMixedPrecision(precision="bf16-mixed"),
    max_epochs=None,
    max_steps=75000,
    max_time="05:23:30:00",
    log_every_n_steps=10,
    val_check_interval=2000,
    limit_val_batches=50,
    limit_test_batches=50,
    accumulate_grad_batches=1,
    gradient_clip_val=1.0,
)

Migration Steps

  1. Remove the trainer section from your YAML config file.

  2. Add the following import to your Python script:

    from nemo import lightning as nl
    
  3. Create a Trainer object with the appropriate parameters:

    trainer = nl.Trainer(
        num_nodes=16,
        devices=8,
        accelerator="gpu",
        plugins=nl.MegatronMixedPrecision(precision="bf16-mixed"),
        max_epochs=None,
        max_steps=75000,
        max_time="05:23:30:00",
        log_every_n_steps=10,
        val_check_interval=2000,
        limit_val_batches=50,
        limit_test_batches=50,
        accumulate_grad_batches=1,
        gradient_clip_val=1.0,
    )
    
  4. Adjust the parameters in the Trainer to match your previous YAML configuration.

  5. Use the trainer object in your training script as needed.

Note

  • The nemo.lightning.Trainer class is identical to PyTorch Lightning’s Trainer for most purposes.

  • NeMo adds integration with its serialization system, allowing for exact recreation of the trainer used in a particular training run.

  • The precision parameter is now set using the MegatronMixedPrecision plugin. Use "bf16-mixed" for BF16 precision.

  • The logger parameter is no longer needed in the trainer configuration, as it’s handled separately by the NeMoLogger (see the exp-manager migration guide).