The spec file for TTS using HiFiGAN includes the trainer , model , training_dataset , validation_dataset , and prior_folder .

The following is a shortened example of a spec file for training HiFiGAN on the LJSpeech dataset.

Copy Copied! train_dataset: ??? validation_dataset: ??? training_ds: dataset: _target_: "nemo.collections.tts.data.datalayers.AudioDataset" manifest_filepath: ${train_dataset} max_duration: null min_duration: 0.75 n_segments: 8192 trim: false dataloader_params: drop_last: false shuffle: true batch_size: 16 num_workers: 4 validation_ds: dataset: _target_: "nemo.collections.tts.data.datalayers.AudioDataset" manifest_filepath: ${validation_dataset} max_duration: null min_duration: null n_segments: -1 trim: false dataloader_params: drop_last: false shuffle: false batch_size: 16 num_workers: 1 model: preprocessor: _target_: nemo.collections.asr.parts.preprocessing.features.FilterbankFeatures dither: 0.0 frame_splicing: 1 nfilt: 80 highfreq: 8000 log: true log_zero_guard_type: clamp log_zero_guard_value: 1e-05 lowfreq: 0 mag_power: 1.0 n_fft: 1024 n_window_size: 1024 n_window_stride: 256 normalize: null pad_to: 0 pad_value: -11.52 preemph: null sample_rate: 22050 window: hann use_grads: false exact_pad: true generator: _target_: nemo.collections.tts.modules.hifigan_modules.Generator resblock: 1 upsample_rates: [8,8,2,2] upsample_kernel_sizes: [16,16,4,4] upsample_initial_channel: 512 resblock_kernel_sizes: [3,7,11] resblock_dilation_sizes: [[1,3,5], [1,3,5], [1,3,5]] optim: _target_: torch.optim.AdamW lr: 0.0002 betas: [0.8, 0.99] sched: name: CosineAnnealing min_lr: 1e-5 warmup_ratio: 0.02 max_steps: ${trainer.max_steps} l1_loss_factor: 45 denoise_strength: 0.0025 trainer: max_steps: 25000

The specification can be roughly grouped into three categories:

Parameters to configure the trainer

Parameters that describe the model

Pointers to the training and validation dataset

This specification can be used with the tao vocoder train command.

If you would like to change a parameter for your run without changing the specification file itself, you can specify it on the command line directly. For example, if you would like to change the validation batch size, you can add validation_ds.dataloader_params.batch_size=1 to your command, which would override the batch size of 16 in the configuration shown above. An example of this is shown in the training instructions below.

The following parameter is used to configure the trainer element of the Spectrogram Generator:

Parameter Datatype Description Supported Values max_steps int The maximum number of steps to train the model. This is a field for the trainer parameter. Unlike the FastPitch spectrogram generator, the HiFiGAN trainer uses max_steps to specify the duration of training. >0

Note NVIDIA suggests setting trainer.max_steps = 10000 at least, to train a good model.





The parameters to help configure the HiFiGAN model are included in the model element. This includes global parameters for the model object and optional parameters to configure the following sub components:

preprocessor generator optimizer scheduler

The global parameters include the following:

Parameter Datatype Description Supported Values max_steps int Specifies the maximum number of steps to train the model. This is a field for the trainer parameter. Unlike the FastPitch spectrogram generator, HiFiGAN trainer uses max_steps to run the duration of training. Derived from trainer.max_steps l1_loss_factor float The multiplicative factor for L1 loss used in training denoise_strength float The small desnoising factor, currently only used in validation

Preprocessor

Parameter Datatype Description Supported Values dither float The amount of white-noise dithering frame_splicing int The number of spectrogram frames per model step highfreq int The upper bound on the Mel basis in Hz log bool Specifies whether to log the spectrogram log_zero_guard_type The guard against taking the log of zero. There are two

options: add and clamp . low_zero_guard_value float/str The guard types require a number to add with or

clamp to. The guard value can be a

float, “tiny”, or “eps”. torch.finfo is used if

“tiny” or “eps” is passed. lowfreq int The lower bound on the Mel basis in Hz mag_power int The power that the linear spectrogram is raised to

prior to multiplication with the Mel basis n_fft int The size of window for fft in samples. n_window_size int The size of window for fft in samples. n_window_stride int The stride of the window for fft. normalize Normalization can be ‘per_feature’ or ‘all_features’; all

other options disable feature normalization.

all_features normalizes the entire spectrogram

to be mean 0 with std 1.

pre_features normalizes per channel/freq instead. pad_to int Ensures that the output size of the time dimension a multiple

of pad_to. pad_value float The value that the shorted Mels are padded with. preemph The amount of pre-emphasis to add to the audio. This can be

disabled by setting the value to None . sample_rate int The target sample rate to load the audio, in kHz. window string The windowing function for fft, which be one of the following:

‘hann’, ‘hamming’, ‘blackman’, ‘bartlett’. use_grads bool Specifies whether to allow gradients to pass through this

module exact_pad bool Specifies whether to pad the input signal such that the

output length is exactly the input length // 4

Generator

Parameter Datatype Description Supported Values resblock int

int The type of residual block to use. See the hifigan paper

for details. 1, 2 upsample_rates array of 4 integers How much each layer upsamples the input, the product of all

numbers in the array must be equal to n_window_stride. upsample_kernel_sizes array of 4 integers The kernel size for each upsampling layer upsample_initial_channel int The first hidden dimension of the layer resblock_kernel_sizes array of 3 integers The kernel sizes of the residual blocks resblock_dilation_sizes array of 3 array of 3 integers The dilation sizes of the residual blocks

The datasets that you use should be specified by <xyz>_ds parameters, depending on the use case:

For training using tao vocoder train , you should have training_ds to describe your training dataset, and validation_ds to describe your validation dataset.

Each <xyz>_ds config contains two main groups of configuration

dataset : The configuration component describing the dataset

dataloader : The configuration componenet describing the dataloader

The configurable fields for the dataset field are described in the following table:

Parameter Datatype Description Supported Values manifest_filepath string The filepath to the manifest (.json file) that describes the audio data Valid filepaths. n_segments int The length of the audio in sample to load. For example, given a sampling rate of 16kHz, and n_segments=16000, a random 1

second of audio from the clip will be loaded. The section will sample randomly every time the audio is batched. This

can be set to -1 to load the entire audio. > 0 max_duration float If audio exceeds this length in seconds, it is filtered from the dataset. min_duration float If the audio is less than this length in seconds, it is filtered from the dataset. trim bool Whether to trim silence from the beginning and end of the audio signal using librosa.effects.trim(). The default

value is False. True/False

Dataloader