Checkpointing

Checkpointing is a feature in DALI which allows you to save the current state of the pipeline to a file. Then, you can restore the pipeline from a saved checkpoint and the new pipeline will produce exactly the same outputs as the old one would. It is particularly useful for long-running training jobs which are likely to be interrupted.

A checkpoint of DALI pipeline contains information about states of all random number generators used in the pipeline and about the progress of each reader.

Checkpointing API

Enabling checkpointing

To enable checkpointing, set enable_checkpointing=True when creating a pipeline. With this option enabled, DALI will track the state of each operator, allowing you to save it on demand. Enabling checkpointing shouldn’t have any impact on the performance.

@pipeline_def(..., enable_checkpointing=True)
def pipeline():
    ...

p = pipeline()
p.build()

Note

Readers with shuffle_after_epoch=True might shuffle samples differently if checkpointing is enabled.

Saving a checkpoint

To save a checkpoint, you need to call Pipeline.checkpoint() method, which will return a serialized checkpoint as a string. Optionally, you can pass filename as an argument and DALI will save the checkpoint there.

for _ in range(iters):
    output = p.run()

# Write the checkpoint to file:
checkpoint = p.checkpoint()
open('checkpoint_file.cpt', 'wb')

# Or simply:
checkpoint = p.checkpoint('checkpoint_file.cpt')

Note

Calling Pipeline.checkpoint() method may introduce an observable overhead. We recommend you not to call it too often.

Restoring from checkpoint

You can later restore pipeline state from a saved checkpoint. To do so, pass checkpoint argument to Pipeline on construction. Such a pipeline should then return exactly the same outputs as the original one.

checkpoint = open('checkpoint_file.cpt', 'rb').read()
p_restored = pipeline(checkpoint=checkpoint)
p_restored.build()

Warning

Make sure that the pipeline that you’re restoring is the same as the original one, i.e. contains the same operators with the same arguments. Restoring from a checkpoint created with a different pipeline will result in undefined behavior.

External source checkpointing

fn.external_source() operator only partially supports checkpointing.

Checkpointing is supported only if source is a single-argument callable accepting batch index, BatchInfo or SampleInfo. For such sources, the queries will continue from the point saved in the checkpoint.

Other kinds of source don’t support checkpointing. Their state won’t be saved in a checkpoint and after restoring from a checkpoint, they will start from the beginning. If you want to use checkpointing, we recommend you rewrite your source to be a supported callable.

Checkpointing in TensorFlow plugin

plugin.tf.DALIDataset is integrated with TensorFlow’s tf.train.checkpoint. Please refer to TensorFlow checkpointing documentation page for more details.

Warning

Checkpointing is currently not supported for plugin.tf.experimental.DALIDatasetWithInputs.

Warning

Checkpointing is currently not supported for GPU datasets.