nvidia.dali.fn.readers.tfrecord#

nvidia.dali.fn.readers.tfrecord( *, device=None, name=None, bytes_per_sample_hint=[0], dont_use_mmap=False, features, index_path, initial_fill=1024, lazy_init=False, num_shards=1, pad_last_batch=False, path, prefetch_queue_depth=1, preserve=False, random_shuffle=False, read_ahead=False, seed=-1, shard_id=0, shuffle_after_epoch=False, shuffle_after_epoch_seed=None, skip_cached_images=False, stick_to_shard=False, tensor_init_bytes=1048576, use_o_direct=False, )#

Reads samples from a TensorFlow TFRecord file.

Supported backends

‘cpu’

Keyword Arguments:

bytes_per_sample_hint¶ (int or list of int, optional, default = [0]) –
Output size hint, in bytes per sample.

If specified, the operator’s outputs residing in GPU or page-locked host memory will be preallocated to accommodate a batch of samples of this size.
dont_use_mmap¶ (bool, optional, default = False) –
If set to True, the Loader will use plain file I/O instead of trying to map the file in memory.

Mapping provides a small performance benefit when accessing a local file system, but most network file systems, do not provide optimum performance.
features¶ (dict of (string, nvidia.dali.tfrecord.Feature)) –
A dictionary that maps names of the TFRecord features to extract to the feature type.

Typically obtained by using the dali.tfrecord.FixedLenFeature and dali.tfrecord.VarLenFeature helper functions, which are equal to TensorFlow’s tf.FixedLenFeature and tf.VarLenFeature types, respectively. For additional flexibility, dali.tfrecord.VarLenFeature supports the partial_shape parameter. If provided, the data will be reshaped to match its value, and the first dimension will be inferred from the data size.

If the named feature doesn’t exists in the processed TFRecord entry an empty tensor is returned.
index_path¶ (str or list of str) –
List of paths to index files. There should be one index file for every TFRecord file.

The index files can be obtained from TFRecord files by using the tfrecord2idx script that is distributed with DALI.
initial_fill¶ (int, optional, default = 1024) –
Size of the buffer that is used for shuffling.

If random_shuffle is False, this parameter is ignored.
lazy_init¶ (bool, optional, default = False) – Parse and prepare the dataset metadata only during the first run instead of in the constructor.
num_shards¶ (int, optional, default = 1) –
Partitions the data into the specified number of parts (shards).

This is typically used for multi-GPU or multi-node training.
pad_last_batch¶ (bool, optional, default = False) –
If set to True, pads the shard by repeating the last sample.

Note

If the number of batches differs across shards, this option can cause an entire batch of repeated samples to be added to the dataset.
path¶ (str or list of str) – List of paths to TFRecord files.
prefetch_queue_depth¶ (int, optional, default = 1) –
Specifies the number of batches to be prefetched by the internal Loader.

This value should be increased when the pipeline is CPU-stage bound, trading memory consumption for better interleaving with the Loader thread.
preserve¶ (bool, optional, default = False) – Prevents the operator from being removed from the graph even if its outputs are not used.
random_shuffle¶ (bool, optional, default = False) –
Determines whether to randomly shuffle data.

A prefetch buffer with a size equal to initial_fill is used to read data sequentially, and then samples are selected randomly to form a batch.
read_ahead¶ (bool, optional, default = False) –
Determines whether the accessed data should be read ahead.

For large files such as LMDB, RecordIO, or TFRecord, this argument slows down the first access but decreases the time of all of the following accesses.
seed¶ (int, optional, default = -1) – Random seed; if not set, one will be assigned automatically.
shard_id¶ (int, optional, default = 0) – Index of the shard to read.
shuffle_after_epoch¶ (bool, optional, default = False) –
If set to True, the reader reshuffles the order of the source files after each epoch, while preserving sequential reads within each file.

This keeps I/O access patterns sequential — only the order in which whole files are visited changes between epochs. random_shuffle can be combined with this option to additionally shuffle samples within the pipeline’s prefetch buffer.

stick_to_shard cannot be used when this argument is set to True.
shuffle_after_epoch_seed¶ (int, optional) –
Random seed for the file-order shuffling performed after each epoch.

If not provided, a fixed default seed is used, which results in the same shuffling pattern across different training runs. Providing a custom seed allows for different shuffle patterns across training runs, which may be desirable for better statistical properties.

Note

When using multiple DALI pipelines (e.g., for multi-GPU training), all pipeline instances should use the same shuffle_after_epoch_seed to ensure a consistent global file-order shuffle across all shards.

Note

This argument has no effect unless shuffle_after_epoch is set to True.
skip_cached_images¶ (bool, optional, default = False) –
If set to True, the loading data will be skipped when the sample is in the decoder cache.

In this case, the output of the loader will be empty.
stick_to_shard¶ (bool, optional, default = False) –
Determines whether the reader should stick to a data shard instead of going through the entire dataset.

If decoder caching is used, it significantly reduces the amount of data to be cached, but might affect accuracy of the training.
tensor_init_bytes¶ (int, optional, default = 1048576) – Hint for how much memory to allocate per image.
use_o_direct¶ (bool, optional, default = False) –
If set to True, the data will be read directly from the storage bypassing the system cache.

Mutually exclusive with dont_use_mmap=False.