Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Upgrade guide to use lightning 2.0
Replace
trainer.strategy=null
withtrainer.strategy=auto
as lightning 2.0 doesn’t have None strategy.Remove
resume_from_checkpoint
if being used as a trainer flag and pass the path to Trainer.fit(ckpt_path=”…”) method.Set
trainer.strategy = "ddp_find_unused_parameters_true"
if there are unused parameters in your model as lightning 2.0 has find_unused_parameters as False by default.Reference: NeMo PR 6433. More details about this change: lightning PR 16611.
If used Trainer’s flag
replace_sampler_ddp
replace it with use_distributed_sampler.If using
CheckpointConnector
replace it with _CheckpointConnector.To set or get
ckpt_path
usetrainer.ckpt_path
directly instead of calling protected API viatrainer._checkpoint_connector._ckpt_path
or usingtrainer._checkpoint_connector.resume_from_checkpoint_fit_path
.Change
import load
from pytorch_lightning.utilities.cloud_io toimport _load
.If used
from pytorch_lightning.plugins.precision.native_amp import NativeMixedPrecisionPlugin
from replace it with from pytorch_lightning.plugins.precision import MixedPrecisionPlugin.Lightning 2.0 adds
'16-mixed'
,'bf16-mixed'
as the preicison values for fp16 mixed precision and bf16 mixed precision respectively.For backward compatbility
16
or'16'
and'bf16'
also perform mixed precision and is equivalent to'16-mixed'
and'bf16-mixed'
respectively. However, lightning recommends to use'16-mixed'
and'bf16-mixed'
to make it less ambiguous. Due to this,MegatronHalfPrecisionPlugin's
parent class from lightningMixedPrecisionPlugin
class, expects the precision arg to be'16-mixed'
and'bf16-mixed'
. As a result it’s required to pass'16-mixed'
or'bf16-mixed'
toMixedPrecisionPLugin
whenever the precision passed is any of[16, '16', '16-mixed']
or['bf16', 'bf16-mixed']
. This can be taken care as shown here: NeMo upgrade to lightning 2.0 PR and here: MixedPrecisionPlugin. Also,'32-true'
is added as a precsion value for pure fp32 along with32
,'32'
that existed. This can be taken into account as shown here in the NeMo upgrade to lightning 2.0 PR.Lightning 2.0 renames epoch end hooks from
training_epoch_end
,validation_epoch_end
,test_epoch_end
toon_train_epoch_end
,on_validation_epoch_end
,on_test_epoch_end
. The renamed hooks do not accept the outputs arg but instead outputs needs to be defined as an instance variable of the model class to which the outputs of the step needs to be manually appended. More detailed examples implementing this can be found under migration guide of lightning’s PR 16520. Example from NeMo can be found here.Lightning 2.0 is not currently supporting multiple dataloders for validation and testing in case of
dataloader_iter
. The support for this will be added back soon in an upcoming release. Ifdataloader_iter
is being used and your config passes multiple files tovalidation_ds.file_names
ortest_ds.file_names
, please use just one file until this issue is fixed with pytorch lightning.With lightning 2.0 it’s required to set
limit_val_batches
andnum_sanity_val_steps
to be a multiple of number of microbatches while usingdataloader_iter
(applies only to Megatron files that use dataloader_iter) for all pretraining files (not downstream tasks like finetuning). This is being taken care internally in NeMo and does not require anything to be done by the user. However, if you are a developer of NeMo and are building a new model for pretraining that usesdataloader_iter
instead of batch invalidation_step
methods please make sure to callself._reconfigure_val_batches()
inbuild_train_valid_test_datasets method
of your model.If model is being wrapped with
LightningDistributedModule
inconfigure_ddp
method please replace it with_LightningModuleWrapperBase
as being done here: NeMo upgrade to lightning 2.0 PR.If using
pre_configure_ddp()
in your DDP, remove it as it’s not required anymore. NeMo upgrade to lightning 2.0 PR.If any of the tests use CPU as the device, ensure to explicitly pass it in the trainer as
trainer = pl.Trainer(max_epochs=1, accelerator='cpu')
since deafult val in PTL >= 2.0 is auto and it picks cuda.If using
from pytorch_lightning.loops import TrainingEpochLoop
, replaceTrainingEpochLoop
with_TrainingEpochLoop
.If using
trainer.fit_loop.max_steps
, replace it withtrainer.fit_loop.epoch_loop.max_steps
.