nvidia.dali.fn.readers.nemo_asr(*inputs, **kwargs)

Reads automatic speech recognition (ASR) data (audio, text) from an NVIDIA NeMo compatible manifest.

Example manifest file:

{"audio_filepath": "path/to/audio1.wav", "duration": 3.45, "text": "this is a nemo tutorial"}
{"audio_filepath": "path/to/audio1.wav", "offset": 3.45, "duration": 1.45, "text": "same audio file but using offset"}
{"audio_filepath": "path/to/audio2.wav", "duration": 3.45, "text": "third transcript in this example"}


Note

Only audio_filepath is field mandatory. If duration is not specified, the whole audio file will be used. A missing text field will produce an empty string as a text.

Warning

Handling of duration and offset fields is not yet implemented. The current implementation always reads the whole audio file.

This reader produces between 1 and 3 outputs:

• Decoded audio data: float, shape=(audio_length,)

• (optional, if read_sample_rate=True) Audio sample rate: float, shape=(1,)

• (optional, if read_text=True) Transcript text as a null terminated string: uint8, shape=(text_len + 1,)

• (optional, if read_idxs=True) Index of the manifest entry: int64, shape=(1,)

Supported backends
• ‘cpu’

Keyword Arguments
• manifest_filepaths (str or list of str) – List of paths to NeMo’s compatible manifest files.

• bytes_per_sample_hint (int or list of int, optional, default = [0]) –

Output size hint, in bytes per sample.

If specified, the operator’s outputs residing in GPU or page-locked host memory will be preallocated to accommodate a batch of samples of this size.

• dont_use_mmap (bool, optional, default = False) –

If set to True, the Loader will use plain file I/O instead of trying to map the file in memory.

Mapping provides a small performance benefit when accessing a local file system, but most network file systems, do not provide optimum performance.

• downmix (bool, optional, default = True) – If True, downmix all input channels to mono. If downmixing is turned on, decoder will produce always 1-D output

• dtype (nvidia.dali.types.DALIDataType, optional, default = DALIDataType.FLOAT) –

Output data type.

Supported types: INT16, INT32, and FLOAT.

• initial_fill (int, optional, default = 1024) –

Size of the buffer that is used for shuffling.

If random_shuffle is False, this parameter is ignored.

• lazy_init (bool, optional, default = False) – Parse and prepare the dataset metadata only during the first run instead of in the constructor.

• max_duration (float, optional, default = 0.0) –

If a value greater than 0 is provided, it specifies the maximum allowed duration, in seconds, of the audio samples.

Samples with a duration longer than this value will be ignored.

• min_duration (float, optional, default = 0.0) –

If a value greater than 0 is provided, it specifies the minimum allowed duration,

in seconds, of the audio samples.

Samples with a duration shorter than this value will be ignored.

• normalize_text (bool) –

Warning

The argument normalize_text is no longer used and will be removed in a future release.

• num_shards (int, optional, default = 1) –

Partitions the data into the specified number of parts (shards).

This is typically used for multi-GPU or multi-node training.

• pad_last_batch (bool, optional, default = False) –

If set to True, pads the shard by repeating the last sample.

Note

If the number of batches differs across shards, this option can cause an entire batch of repeated samples to be added to the dataset.

• prefetch_queue_depth (int, optional, default = 1) –

Specifies the number of batches to be prefetched by the internal Loader.

This value should be increased when the pipeline is CPU-stage bound, trading memory consumption for better interleaving with the Loader thread.

• preserve (bool, optional, default = False) – Prevents the operator from being removed from the graph even if its outputs are not used.

• quality (float, optional, default = 50.0) –

Resampling quality, 0 is lowest, 100 is highest.

0 corresponds to 3 lobes of the sinc filter; 50 gives 16 lobes and 100 gives 64 lobes.

• random_shuffle (bool, optional, default = False) –

Determines whether to randomly shuffle data.

A prefetch buffer with a size equal to initial_fill is used to read data sequentially, and then samples are selected randomly to form a batch.

For large files such as LMDB, RecordIO, or TFRecord, this argument slows down the first access but decreases the time of all of the following accesses.

• read_idxs (bool, optional, default = False) –

Whether to output the indices of samples as they occur in the manifest file

as a separate output

• read_sample_rate (bool, optional, default = True) – Whether to output the sample rate for each sample as a separate output

• read_text (bool, optional, default = True) – Whether to output the transcript text for each sample as a separate output

• sample_rate (float, optional, default = -1.0) – If specified, the target sample rate, in Hz, to which the audio is resampled.

• seed (int, optional, default = -1) –

Random seed.

If not provided, it will be populated based on the global seed of the pipeline.

• shard_id (int, optional, default = 0) – Index of the shard to read.

• shuffle_after_epoch (bool, optional, default = False) – If true, reader shuffles whole dataset after each epoch

• skip_cached_images (bool, optional, default = False) –

If set to True, the loading data will be skipped when the sample is in the decoder cache.

In this case, the output of the loader will be empty.

• stick_to_shard (bool, optional, default = False) –

Determines whether the reader should stick to a data shard instead of going through the entire dataset.

If decoder caching is used, it significantly reduces the amount of data to be cached, but might affect accuracy of the training.

• tensor_init_bytes (int, optional, default = 1048576) – Hint for how much memory to allocate per image.