bridge.training.tensor_inspect#

Module Contents#

Functions#

initialize_tensor_inspect_pre_model_initialization

Initialize NVIDIA-DL-Framework-Inspect before model construction.

_maybe_attach_metric_loggers

Attach supported metric loggers (TensorBoard, W&B raw module).

finalize_tensor_inspect_post_model_initialization

Finalize setup after model creation: attach loggers, set names and groups.

tensor_inspect_step_if_enabled

Advance DLFw Inspect step if enabled.

tensor_inspect_end_if_enabled

Shutdown DLFw Inspect if enabled.

Data#

API#

bridge.training.tensor_inspect.MISSING_NVINSPECT_MSG#

‘nvdlfw_inspect is not available. Please install it with pip install nvdlfw-inspect.’

bridge.training.tensor_inspect.initialize_tensor_inspect_pre_model_initialization(
tensor_inspect_config: megatron.bridge.training.config.TensorInspectConfig | None,
) None#

Initialize NVIDIA-DL-Framework-Inspect before model construction.

When enabled and the API is unavailable or fails, raise to stop training.

bridge.training.tensor_inspect._maybe_attach_metric_loggers(
tensorboard_logger: Any | None,
wandb_logger: Any | None,
) None#

Attach supported metric loggers (TensorBoard, W&B raw module).

bridge.training.tensor_inspect.finalize_tensor_inspect_post_model_initialization(
tensor_inspect_config: megatron.bridge.training.config.TensorInspectConfig | None,
model: list[megatron.core.transformer.MegatronModule],
tensorboard_logger: Any | None,
wandb_logger: Any | None,
current_training_step: int | None = None,
) None#

Finalize setup after model creation: attach loggers, set names and groups.

bridge.training.tensor_inspect.tensor_inspect_step_if_enabled(
tensor_inspect_config: megatron.bridge.training.config.TensorInspectConfig | None,
) None#

Advance DLFw Inspect step if enabled.

bridge.training.tensor_inspect.tensor_inspect_end_if_enabled(
tensor_inspect_config: megatron.bridge.training.config.TensorInspectConfig | None,
) None#

Shutdown DLFw Inspect if enabled.