nvidia.dali.fn.readers.webdataset

nvidia.dali.fn.readers.webdataset(*, ext, paths, bytes_per_sample_hint=[0], case_sensitive_extensions=True, dont_use_mmap=False, dtypes=None, index_paths=[], initial_fill=1024, lazy_init=False, missing_component_behavior='', num_shards=1, pad_last_batch=False, prefetch_queue_depth=1, preserve=False, random_shuffle=False, read_ahead=False, seed=-1, shard_id=0, skip_cached_images=False, stick_to_shard=False, tensor_init_bytes=1048576, device=None, name=None)

A reader for the webdataset format.

The webdataset format is a way of providing efficient access to datasets stored in tar archives.

Storing data in POSIX tar archives greatly speeds up I/O operations on mechanical storage devices and on network file systems because it allows the operating system to reduce the number of I/O operations and to read the data ahead.

WebDataset fulfils a similar function to Tensorflow’s TFRecord/tf.Example classes, but is much easier to adopt because it does not actually require any data conversion. The data is stored in exactly the same format inside tar files as it is on disk, and all preprocessing and data augmentation code remains unchanged.

The dataset consists of one or more tar archives, each of which is further split into samples. A sample contains one or more components that correspond to the actual files contained within the archive. The components that belong to a specific sample are aggregated by filename without extension (for the specifics about the extensions please read the description of the ext parameter below). Note that samples with their filename starting with a dot will not be loaded, as well as entries that are not regular files.

In addition to the tar archive with data, each archive should come with a corresponding index file. The index file can be generated using a dedicated script:

<path_to_dali>/tools/wds2idx.py <path_to_archive> <path_to_index_file>

If the index file is not provided, it will be automatically inferred from the tar file. Keep in mind though that it will add considerable startup time for big datasets.

The format of the index file is:

v1.2 <num_samples>
<component1_ext> <component1_data_offset> <component1_size> <component2_ext> <component2_data_offset> <component2_size> ...
...

Based on https://github.com/webdataset/webdataset

Supported backends
  • ‘cpu’

Keyword Arguments:
  • ext (str or list of str) –

    The extension sets for each of the outputs produced.

    The number of extension sets determines the number of outputs of the reader. The extensions of the components are counted as the text after the first dot in the name of the file (excluding the samples starting with a dot). The different extension options should be separated with a semicolon (‘;’) and may contain dots.

    Example: “left.png;right.jpg”

  • paths (str or list of str) –

    The list of (one or more) paths to the webdataset archives.

    Has to be the same length as the index_paths argument.

  • bytes_per_sample_hint (int or list of int, optional, default = [0]) –

    Output size hint, in bytes per sample.

    If specified, the operator’s outputs residing in GPU or page-locked host memory will be preallocated to accommodate a batch of samples of this size.

  • case_sensitive_extensions (bool, optional, default = True) –

    Determines whether the extensions provided via the ext should be case sensitive.

    Allows mixing case sizes in the ext argument as well as in the webdataset container. For example when turned off: jpg, JPG, jPG should work.

    If the extension characters cannot be represented as ASCI the result of turing this option off is undefined.

  • dont_use_mmap (bool, optional, default = False) –

    If set to True, the Loader will use plain file I/O instead of trying to map the file in memory.

    Mapping provides a small performance benefit when accessing a local file system, but most network file systems, do not provide optimum performance.

  • dtypes (DALIDataType or list of DALIDataType, optional) –

    Data types of the respective outputs.

    The default output data types are UINT8. However, if set, each output data type should be specified. Moreover, the tar file should be constructed so that it will only output a sample with its byte size divisible by the size of the data type.

  • index_paths (str or list of str, optional, default = []) –

    The list of the index files corresponding to the respective webdataset archives.

    Has to be the same length as the paths argument. In case it is not provided, it will be inferred automatically from the webdataset archive.

  • initial_fill (int, optional, default = 1024) –

    Size of the buffer that is used for shuffling.

    If random_shuffle is False, this parameter is ignored.

  • lazy_init (bool, optional, default = False) – Parse and prepare the dataset metadata only during the first run instead of in the constructor.

  • missing_component_behavior (str, optional, default = ‘’) –

    Specifies what to do in case there is not any file in a sample corresponding to a certain output.

    Possible behaviors:
    • ”empty” (default) - in that case the output that was not set will just contain an empty tensor

    • ”skip” - in that case the entire sample will just be skipped (no penalty to performance except for reduced caching of the archive)

    • ”error” - in that case an exception will be raised and te execution stops

  • num_shards (int, optional, default = 1) –

    Partitions the data into the specified number of parts (shards).

    This is typically used for multi-GPU or multi-node training.

  • pad_last_batch (bool, optional, default = False) –

    If set to True, pads the shard by repeating the last sample.

    Note

    If the number of batches differs across shards, this option can cause an entire batch of repeated samples to be added to the dataset.

  • prefetch_queue_depth (int, optional, default = 1) –

    Specifies the number of batches to be prefetched by the internal Loader.

    This value should be increased when the pipeline is CPU-stage bound, trading memory consumption for better interleaving with the Loader thread.

  • preserve (bool, optional, default = False) – Prevents the operator from being removed from the graph even if its outputs are not used.

  • random_shuffle (bool, optional, default = False) –

    Determines whether to randomly shuffle data.

    A prefetch buffer with a size equal to initial_fill is used to read data sequentially, and then samples are selected randomly to form a batch.

  • read_ahead (bool, optional, default = False) –

    Determines whether the accessed data should be read ahead.

    For large files such as LMDB, RecordIO, or TFRecord, this argument slows down the first access but decreases the time of all of the following accesses.

  • seed (int, optional, default = -1) –

    Random seed.

    If not provided, it will be populated based on the global seed of the pipeline.

  • shard_id (int, optional, default = 0) – Index of the shard to read.

  • skip_cached_images (bool, optional, default = False) –

    If set to True, the loading data will be skipped when the sample is in the decoder cache.

    In this case, the output of the loader will be empty.

  • stick_to_shard (bool, optional, default = False) –

    Determines whether the reader should stick to a data shard instead of going through the entire dataset.

    If decoder caching is used, it significantly reduces the amount of data to be cached, but might affect accuracy of the training.

  • tensor_init_bytes (int, optional, default = 1048576) – Hint for how much memory to allocate per image.