core.dist_checkpointing.core#

Module for managing distributed checkpoints metadata.

Module Contents#

Classes#

CheckpointingConfig

Documents backends used in the checkpoint.

Functions#

check_is_distributed_checkpoint

Checks if metadata.json exists in the checkpoint and is a valid config.

maybe_load_config

Returns checkpoint config if checkpoint_dir is a distributed checkpoint and None otherwise

save_config

Save given config to checkpoint directory.

Data#

API#

core.dist_checkpointing.core.CONFIG_FNAME#

‘metadata.json’

exception core.dist_checkpointing.core.CheckpointingException#

Bases: Exception

Base checkpointing related exception

Initialization

Initialize self. See help(type(self)) for accurate signature.

class core.dist_checkpointing.core.CheckpointingConfig#

Documents backends used in the checkpoint.

Checkpoint config keeps track of formats used for storing the sharded tensors (sharded_backend) and other objects (common_backend).

Note that versioning is not for the checkpoint content (which is application specific), but for the checkpoint format itself.

sharded_backend: str#

None

sharded_backend_version: int#

1

common_backend: str#

‘torch’

common_backend_version: int#

1

core.dist_checkpointing.core.check_is_distributed_checkpoint(checkpoint_dir)#

Checks if metadata.json exists in the checkpoint and is a valid config.

Parameters:

checkpoint_dir – checkpoint directory

Returns:

True if metadata.json exists in the checkpoint and is a valid config.

Return type:

bool

core.dist_checkpointing.core.maybe_load_config(
checkpoint_dir: str,
) Optional[core.dist_checkpointing.core.CheckpointingConfig]#

Returns checkpoint config if checkpoint_dir is a distributed checkpoint and None otherwise

Parameters:

checkpoint_dir – checkpoint directory

Returns:

None if checkpoint is not a valid distributed checkpoint

Return type:

CheckpointingConfig (optional)

core.dist_checkpointing.core.save_config(
config: core.dist_checkpointing.core.CheckpointingConfig,
checkpoint_dir: str,
)#

Save given config to checkpoint directory.

Parameters:
  • config – checkpoint config

  • checkpoint_dir – checkpoint directory

Returns:

None