Checkpointing¶
Checkpointing is a feature in DALI which allows you to save the current state of the pipeline to a file. Then, you can restore the pipeline from a saved checkpoint and the new pipeline will produce exactly the same outputs as the old one would. It is particularly useful for long-running training jobs which are likely to be interrupted.
A checkpoint of DALI pipeline contains information about states of all random number generators used in the pipeline and about the progress of each reader.
Checkpointing API¶
Enabling checkpointing¶
To enable checkpointing, set enable_checkpointing=True
when creating a pipeline.
With this option enabled, DALI will track the state of each operator, allowing you to save it on demand.
Enabling checkpointing shouldn’t have any impact on the performance.
@pipeline_def(..., enable_checkpointing=True)
def pipeline():
...
p = pipeline()
p.build()
Note
Readers with shuffle_after_epoch=True
might shuffle samples differently if checkpointing is enabled.
Saving a checkpoint¶
To save a checkpoint, you need to call Pipeline.checkpoint()
method, which will return a serialized checkpoint as a string.
Optionally, you can pass filename as an argument and DALI will save the checkpoint there.
for _ in range(iters):
output = p.run()
# Write the checkpoint to file:
checkpoint = p.checkpoint()
open('checkpoint_file.cpt', 'wb')
# Or simply:
checkpoint = p.checkpoint('checkpoint_file.cpt')
Note
Calling Pipeline.checkpoint()
method may introduce an observable overhead.
We recommend you not to call it too often.
Restoring from checkpoint¶
You can later restore pipeline state from a saved checkpoint.
To do so, pass checkpoint argument to Pipeline
on construction.
Such a pipeline should then return exactly the same outputs as the original one.
checkpoint = open('checkpoint_file.cpt', 'rb').read()
p_restored = pipeline(checkpoint=checkpoint)
p_restored.build()
Warning
Make sure that the pipeline that you’re restoring is the same as the original one, i.e. contains the same operators with the same arguments. Restoring from a checkpoint created with a different pipeline will result in undefined behavior.
External source checkpointing¶
fn.external_source()
operator only partially supports checkpointing.
Checkpointing is supported only if source
is a single-argument callable accepting
batch index, BatchInfo
or SampleInfo
.
For such sources
, the queries will continue from the point saved in the checkpoint.
Other kinds of source
don’t support checkpointing.
Their state won’t be saved in a checkpoint and
after restoring from a checkpoint, they will start from the beginning.
If you want to use checkpointing, we recommend you rewrite your source
to be a supported callable.
Checkpointing in TensorFlow plugin¶
plugin.tf.DALIDataset
is integrated with TensorFlow’s tf.train.checkpoint
.
Please refer to
TensorFlow checkpointing documentation page
for more details.
Warning
Checkpointing is currently not supported for plugin.tf.experimental.DALIDatasetWithInputs
.
Warning
Checkpointing is currently not supported for GPU datasets.