Vocoder - NVIDIA Docs

A vocoder is a model that generates audio from a Mel spectrogram. HiFiGAN is a generative adversarial network (GAN) model that generates audio from Mel spectrograms. The generator uses transposed convolutions to upsample Mel spectrograms to audio

The following tasks have been implemented for HiFiGAN in the TAO Toolkit:

download_specs
dataset_convert
train
infer
export

Downloading Sample Spec Files

The example specification files for all the tasks associated with the spectrogram generator component of TTS can be downloaded using the following command:

Copy
Copied!

            
            tao vocoder download_specs   \
            -o <target_path> \
            -r <results_path>

Required Arguments

-o: The target path where the spec files will be stored
-r: The results and output log directory

Preparing the Dataset

The vocoder for the TAO Toolkit implements the dataset_convert task to convert and prepare datasets that follow the LJSpeech dataset format.

The format and instruction to consume the data in TAO toolkit is identical to the dataset_convert task under Spectrogram Generator.

Creating an Experiment Spec File

The spec file for TTS using HiFiGAN includes the trainer, model, training_dataset, validation_dataset, and prior_folder.

The following is a shortened example of a spec file for training HiFiGAN on the LJSpeech dataset.

Copy
Copied!

            
            train_dataset: ???
validation_dataset: ???

training_ds:
  dataset:
    _target_: "nemo.collections.tts.data.datalayers.AudioDataset"
    manifest_filepath: ${train_dataset}
    max_duration: null
    min_duration: 0.75
    n_segments: 8192
    trim: false
  dataloader_params:
    drop_last: false
    shuffle: true
    batch_size: 16
    num_workers: 4

validation_ds:
  dataset:
    _target_: "nemo.collections.tts.data.datalayers.AudioDataset"
    manifest_filepath: ${validation_dataset}
    max_duration: null
    min_duration: null
    n_segments: -1
    trim: false
  dataloader_params:
    drop_last: false
    shuffle: false
    batch_size: 16
    num_workers: 1

model:
  preprocessor:
    _target_: nemo.collections.asr.parts.preprocessing.features.FilterbankFeatures
    dither: 0.0
    frame_splicing: 1
    nfilt: 80
    highfreq: 8000
    log: true
    log_zero_guard_type: clamp
    log_zero_guard_value: 1e-05
    lowfreq: 0
    mag_power: 1.0
    n_fft: 1024
    n_window_size: 1024
    n_window_stride: 256
    normalize: null
    pad_to: 0
    pad_value: -11.52
    preemph: null
    sample_rate: 22050
    window: hann
    use_grads: false
    exact_pad: true

  generator:
    _target_: nemo.collections.tts.modules.hifigan_modules.Generator
    resblock: 1
    upsample_rates: [8,8,2,2]
    upsample_kernel_sizes: [16,16,4,4]
    upsample_initial_channel: 512
    resblock_kernel_sizes: [3,7,11]
    resblock_dilation_sizes: [[1,3,5], [1,3,5], [1,3,5]]

  optim:
    _target_: torch.optim.AdamW
    lr: 0.0002
    betas: [0.8, 0.99]

  sched:
    name: CosineAnnealing
    min_lr: 1e-5
    warmup_ratio: 0.02

  max_steps: ${trainer.max_steps}
  l1_loss_factor: 45
  denoise_strength: 0.0025

trainer:
  max_steps: 25000

The specification can be roughly grouped into three categories:

Parameters to configure the trainer
Parameters that describe the model
Pointers to the training and validation dataset

This specification can be used with the tao vocoder train command.

If you would like to change a parameter for your run without changing the specification file itself, you can specify it on the command line directly. For example, if you would like to change the validation batch size, you can add validation_ds.dataloader_params.batch_size=1 to your command, which would override the batch size of 16 in the configuration shown above. An example of this is shown in the training instructions below.

Configuring the Trainer

The following parameter is used to configure the trainer element of the Spectrogram Generator:

Parameter	Datatype	Description	Supported Values
`max_steps`	int	The maximum number of steps to train the model. This is a field for the `trainer` parameter. Unlike the FastPitch spectrogram generator, the HiFiGAN trainer uses max_steps to specify the duration of training.	>0

Note

NVIDIA suggests setting trainer.max_steps = 10000 atleast, to train a good model.

Configuring the model

The parameters to help configure the HiFiGAN model are included in the model element. This includes global parameters for the model object and optional parameters to configure the following sub components:

preprocessor
generator
optimizer
scheduler

The global parameters include the following:

Parameter	Datatype	Description	Supported Values
`max_steps`	int	Specifies the maximum number of steps to train the model. This is a field for the `trainer` parameter. Unlike the FastPitch spectrogram generator, HiFiGAN trainer uses max_steps to run the duration of training.	Derived from trainer.max_steps
`l1_loss_factor`	float	The multiplicative factor for L1 loss used in training
`denoise_strength`	float	The small desnoising factor, currently only used in validation

Preprocessor

Parameter	Datatype	Description
`dither`	float	The amount of white-noise dithering
`frame_splicing`	int	The number of spectrogram frames per model step
`highfreq`	int	The upper bound on the Mel basis in Hz
`log`	bool	Specifies whether to log the spectrogram
`log_zero_guard_type`		The guard against taking the log of zero. There are two options: `add` and `clamp`.
`low_zero_guard_value`	float/str	The guard types require a number to add with or clamp to. The guard value can be a float, “tiny”, or “eps”. torch.finfo is used if “tiny” or “eps” is passed.
`lowfreq`	int	The lower bound on the Mel basis in Hz
`mag_power`	int	The power that the linear spectrogram is rasied to prior to multiplication with the Mel basis
`n_fft`	int	The size of window for fft in samples.
`n_window_size`	int	The size of window for fft in samples.
`n_window_stride`	int	The stride of the window for fft.
`normalize`		Normalization can be ‘per_feature’ or ‘all_features’; all other options disable feature normalization. `all_features` normalizes the entire spectrogram to be mean 0 with std 1. `pre_features` normalizes per channel/freq instead.
`pad_to`	int	Ensures that the output size of the time dimension a mutliple of pad_to.
`pad_value`	float	The value that the shorted Mels are padded with.
`preemph`		The amount of pre-emphasis to add to the audio. This can be disabled by setting the value to `None`.
`sample_rate`	int	The target sample rate to load the audio, in kHz.
`window`	string	The windowing function for fft, which be one of the following: ‘hann’, ‘hamming’, ‘blackman’, ‘bartlett’.
`use_grads`	bool	Specifies whether to allow gradients to pass through this module
`exact_pad`	bool	Specifies whether to pad the input signal such that the output length is exactly the input length // 4

Generator

Parameter	Datatype	Description	Supported Values
`resblock`	int int	The type of residual block to use. See the hifigan paper for details.	1, 2
`upsample_rates`	array of 4 integers	How much each layer upsamples the input, the product of all numbers in the array must be equal to n_window_stride.
`upsample_kernel_sizes`	array of 4 integers	The kernel size for each upsampling layer
`upsample_initial_channel`	int	The first hidden dimension of the layer
`resblock_kernel_sizes`	array of 3 integers	The kernel sizes of the residual blocks
`resblock_dilation_sizes`	array of 3 array of 3 integers	The dilation sizes of the residual blocks

Configure the dataset

The datasets that you use should be specified by <xyz>_ds parameters, depending on the use case:

For training using tao vocoder train, you should have training_ds to describe your training dataset, and validation_ds to describe your validation dataset.

Each <xyz>_ds config contains two main groups of configuration

dataset: The configuration component describing the dataset
dataloader: The configuration componenet describing the dataloader

The configurable fields for the dataset field are described in the following table:

Parameter	Datatype	Description	Supported Values
`manifest_filepath`	string	The filepath to the manifest (.json file) that describes the audio data	Valid filepaths.
`n_segments`	int	The length of the audio in sample to load. For example, given a sampling rate of 16kHz, and n_segments=16000, a random 1 second of audio from the clip will be loaded. The section will sample randomly every time the audio is batched. This can be set to -1 to load the entire audio.	> 0
`max_duration`	float	If audio exceeds this length in seconds, it is filtered from the dataset.
`min_duration`	float	If the audio is less than this length in seconds, it is filtered from the dataset.
`trim`	bool	Whether to trim silence from the beginning and end of the audio signal using librosa.effects.trim(). The default value is False.	True/False

Dataloader

Parameter	Datatype	Description	Supported Values
`num_workers`	integer	The number of worker threads for loading the dataset	2
`shuffle`	bool	Whether to shuffle the data. We recommend True for training data and False for validation.	True/False
`batch_size`	integer	The training data batch size
`drop_last`	bool	Specifies whether to drop the last batch if there aren’t enough samples to fill the batch.	True/False

Training the Model

To train a model from scratch, use the following command:

Copy
Copied!

            
            tao vocoder train \
                -e <experiment_spec> \
                -g <num_gpus> \
                -r /path/to/the/results/directory \
                -k <encryption_key>

As mentioned above, you can add additional arguments to override configurations from your experiment specification file. This allows you to create valid spec files that leave these fields blank, to be specified as command line arguments at runtime.

For example, the following command can be used to override the training manifest and validation manifest, the number of epochs to train, and the place to save the model checkpoint:

Copy
Copied!

            
            tao vocoder train \
  -e $SPECS_DIR/vocoder/train.yaml \
  -g 1 \
  -k $KEY \
  -r $RESULTS_DIR/vocoder/train \
  train_dataset=$DATA_DIR/ljspeech/ljspeech_train.json \
  validation_dataset=$DATA_DIR/ljspeech/ljspeech_val.json \
  trainer.max_steps=10000

Required Arguments

-e: The experiment specification file to set up training, as in the example given above
-r: The path to the results and log directory. Log files, checkpoints, etc. will be stored here
-k: The key to encrypt the model
Other arguments to override fields in the specification file.

Optional Arguments

-g: The number of GPUs to be used in the training in a multi-GPU scenario. The default value is 1.

Training Procedure

At the start of each training experiment, TAO Toolkit will print out a log of the experiment specification, including any parameters added or overridden via the command line. It will also show additional information, such as which GPUs are available, where logs will be saved, how many hours are in each loaded dataset, and how much of each dataset has been filtered.

Copy
Copied!

            
            GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
[NeMo W 2021-11-02 23:46:59 exp_manager:332] There was no checkpoint folder at checkpoint_dir :/results/vocoder/train_v2/checkpoints. Training from scratch.
[NeMo I 2021-11-02 23:46:59 exp_manager:220] Experiments will be logged at /results/vocoder/train_v2
[NeMo W 2021-11-02 23:46:59 exp_manager:823] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 10000. Please ensure that max_steps will run for at least 1 epochs to ensure that checkpointing will not error out.
[NeMo W 2021-11-02 23:46:59 nemo_logging:349] /opt/conda/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:240: LightningDeprecationWarning: `ModelCheckpoint(every_n_val_epochs)` is deprecated in v1.4 and will be removed in v1.6. Please use `every_n_epochs` instead.
      rank_zero_deprecation(

[NeMo I 2021-11-02 23:46:59 features:252] PADDING: 0
[NeMo I 2021-11-02 23:46:59 features:269] STFT using torch
[NeMo I 2021-11-02 23:46:59 features:271] STFT using exact pad
[NeMo I 2021-11-02 23:46:59 features:252] PADDING: 0
[NeMo I 2021-11-02 23:46:59 features:269] STFT using torch
[NeMo I 2021-11-02 23:46:59 features:271] STFT using exact pad
[NeMo I 2021-11-02 23:47:01 collections:173] Dataset loaded with 12500 files totalling 22.84 hours
[NeMo I 2021-11-02 23:47:01 collections:174] 0 files were filtered totalling 0.00 hours
[NeMo I 2021-11-02 23:47:01 collections:173] Dataset loaded with 100 files totalling 0.18 hours
[NeMo I 2021-11-02 23:47:01 collections:174] 0 files were filtered totalling 0.00 hours
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1
Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for 1 nodes.
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All DDP processes registered. Starting ddp with 1 processes
----------------------------------------------------------------------------------------------------

You should next see a full printout of the number of parameters in each module and submodule, as well as the total number of trainable and non-trainable parameters in the model.

In the following table, the generator module contains 13.9 million parameters and its submodule generator.conv_pre contains 287,000 parameters. The audio_to_melspec_preprocess is listed with no parameters.

Copy
Copied!

            
                | Name                             | Type                     | Params
--------------------------------------------------------------------------------
0   | audio_to_melspec_precessor       | FilterbankFeatures       | 0
1   | trg_melspec_fn                   | FilterbankFeatures       | 0
2   | generator                        | Generator                | 13.9 M
3   | generator.conv_pre               | Conv1d                   | 287 K
4   | generator.ups                    | ModuleList               | 2.7 M
5   | generator.ups.0                  | ConvTranspose1d          | 2.1 M
6   | generator.ups.1                  | ConvTranspose1d          | 524 K
7   | generator.ups.2                  | ConvTranspose1d          | 33.0 K
8   | generator.ups.3                  | ConvTranspose1d          | 8.3 K
9   | generator.resblocks              | ModuleList               | 11.0 M
10  | generator.resblocks.0            | ModuleList               | 8.3 M
11  | generator.resblocks.0.0          | ResBlock1                | 1.2 M
12  | generator.resblocks.0.0.convs1   | ModuleList               | 591 K
13  | generator.resblocks.0.0.convs1.0 | Conv1d                   | 197 K
14  | generator.resblocks.0.0.convs1.1 | Conv1d                   | 197 K
15  | generator.resblocks.0.0.convs1.2 | Conv1d                   | 197 K
16  | generator.resblocks.0.0.convs2   | ModuleList               | 591 K
17  | generator.resblocks.0.0.convs2.0 | Conv1d                   | 197 K
18  | generator.resblocks.0.0.convs2.1 | Conv1d                   | 197 K
19  | generator.resblocks.0.0.convs2.2 | Conv1d                   | 197 K
20  | generator.resblocks.0.1          | ResBlock1                | 2.8 M
21  | generator.resblocks.0.1.convs1   | ModuleList               | 1.4 M
22  | generator.resblocks.0.1.convs1.0 | Conv1d                   | 459 K
23  | generator.resblocks.0.1.convs1.1 | Conv1d                   | 459 K
24  | generator.resblocks.0.1.convs1.2 | Conv1d                   | 459 K
25  | generator.resblocks.0.1.convs2   | ModuleList               | 1.4 M
26  | generator.resblocks.0.1.convs2.0 | Conv1d                   | 459 K
27  | generator.resblocks.0.1.convs2.1 | Conv1d                   | 459 K
28  | generator.resblocks.0.1.convs2.2 | Conv1d                   | 459 K
29  | generator.resblocks.0.2          | ResBlock1                | 4.3 M
30  | generator.resblocks.0.2.convs1   | ModuleList               | 2.2 M
31  | generator.resblocks.0.2.convs1.0 | Conv1d                   | 721 K
32  | generator.resblocks.0.2.convs1.1 | Conv1d                   | 721 K
33  | generator.resblocks.0.2.convs1.2 | Conv1d                   | 721 K
34  | generator.resblocks.0.2.convs2   | ModuleList               | 2.2 M
35  | generator.resblocks.0.2.convs2.0 | Conv1d                   | 721 K
36  | generator.resblocks.0.2.convs2.1 | Conv1d                   | 721 K
37  | generator.resblocks.0.2.convs2.2 | Conv1d                   | 721 K
38  | generator.resblocks.1            | ModuleList               | 2.1 M
39  | generator.resblocks.1.0          | ResBlock1                | 296 K
40  | generator.resblocks.1.0.convs1   | ModuleList               | 148 K
41  | generator.resblocks.1.0.convs1.0 | Conv1d                   | 49.4 K
42  | generator.resblocks.1.0.convs1.1 | Conv1d                   | 49.4 K
43  | generator.resblocks.1.0.convs1.2 | Conv1d                   | 49.4 K
44  | generator.resblocks.1.0.convs2   | ModuleList               | 148 K
45  | generator.resblocks.1.0.convs2.0 | Conv1d                   | 49.4 K
46  | generator.resblocks.1.0.convs2.1 | Conv1d                   | 49.4 K
47  | generator.resblocks.1.0.convs2.2 | Conv1d                   | 49.4 K
48  | generator.resblocks.1.1          | ResBlock1                | 689 K
49  | generator.resblocks.1.1.convs1   | ModuleList               | 344 K
50  | generator.resblocks.1.1.convs1.0 | Conv1d                   | 114 K
51  | generator.resblocks.1.1.convs1.1 | Conv1d                   | 114 K
52  | generator.resblocks.1.1.convs1.2 | Conv1d                   | 114 K
53  | generator.resblocks.1.1.convs2   | ModuleList               | 344 K
54  | generator.resblocks.1.1.convs2.0 | Conv1d                   | 114 K
55  | generator.resblocks.1.1.convs2.1 | Conv1d                   | 114 K
56  | generator.resblocks.1.1.convs2.2 | Conv1d                   | 114 K
57  | generator.resblocks.1.2          | ResBlock1                | 1.1 M
58  | generator.resblocks.1.2.convs1   | ModuleList               | 541 K
59  | generator.resblocks.1.2.convs1.0 | Conv1d                   | 180 K
60  | generator.resblocks.1.2.convs1.1 | Conv1d                   | 180 K
61  | generator.resblocks.1.2.convs1.2 | Conv1d                   | 180 K
62  | generator.resblocks.1.2.convs2   | ModuleList               | 541 K
63  | generator.resblocks.1.2.convs2.0 | Conv1d                   | 180 K
64  | generator.resblocks.1.2.convs2.1 | Conv1d                   | 180 K
65  | generator.resblocks.1.2.convs2.2 | Conv1d                   | 180 K
66  | generator.resblocks.2            | ModuleList               | 518 K
67  | generator.resblocks.2.0          | ResBlock1                | 74.5 K
68  | generator.resblocks.2.0.convs1   | ModuleList               | 37.2 K
69  | generator.resblocks.2.0.convs1.0 | Conv1d                   | 12.4 K
70  | generator.resblocks.2.0.convs1.1 | Conv1d                   | 12.4 K
71  | generator.resblocks.2.0.convs1.2 | Conv1d                   | 12.4 K
72  | generator.resblocks.2.0.convs2   | ModuleList               | 37.2 K
73  | generator.resblocks.2.0.convs2.0 | Conv1d                   | 12.4 K
74  | generator.resblocks.2.0.convs2.1 | Conv1d                   | 12.4 K
75  | generator.resblocks.2.0.convs2.2 | Conv1d                   | 12.4 K
76  | generator.resblocks.2.1          | ResBlock1                | 172 K
77  | generator.resblocks.2.1.convs1   | ModuleList               | 86.4 K
78  | generator.resblocks.2.1.convs1.0 | Conv1d                   | 28.8 K
79  | generator.resblocks.2.1.convs1.1 | Conv1d                   | 28.8 K
80  | generator.resblocks.2.1.convs1.2 | Conv1d                   | 28.8 K
81  | generator.resblocks.2.1.convs2   | ModuleList               | 86.4 K
82  | generator.resblocks.2.1.convs2.0 | Conv1d                   | 28.8 K
83  | generator.resblocks.2.1.convs2.1 | Conv1d                   | 28.8 K
84  | generator.resblocks.2.1.convs2.2 | Conv1d                   | 28.8 K
85  | generator.resblocks.2.2          | ResBlock1                | 271 K
86  | generator.resblocks.2.2.convs1   | ModuleList               | 135 K
87  | generator.resblocks.2.2.convs1.0 | Conv1d                   | 45.2 K
88  | generator.resblocks.2.2.convs1.1 | Conv1d                   | 45.2 K
89  | generator.resblocks.2.2.convs1.2 | Conv1d                   | 45.2 K
90  | generator.resblocks.2.2.convs2   | ModuleList               | 135 K
91  | generator.resblocks.2.2.convs2.0 | Conv1d                   | 45.2 K
92  | generator.resblocks.2.2.convs2.1 | Conv1d                   | 45.2 K
93  | generator.resblocks.2.2.convs2.2 | Conv1d                   | 45.2 K
94  | generator.resblocks.3            | ModuleList               | 130 K
95  | generator.resblocks.3.0          | ResBlock1                | 18.8 K
96  | generator.resblocks.3.0.convs1   | ModuleList               | 9.4 K
97  | generator.resblocks.3.0.convs1.0 | Conv1d                   | 3.1 K
98  | generator.resblocks.3.0.convs1.1 | Conv1d                   | 3.1 K
99  | generator.resblocks.3.0.convs1.2 | Conv1d                   | 3.1 K
100 | generator.resblocks.3.0.convs2   | ModuleList               | 9.4 K
101 | generator.resblocks.3.0.convs2.0 | Conv1d                   | 3.1 K
102 | generator.resblocks.3.0.convs2.1 | Conv1d                   | 3.1 K
103 | generator.resblocks.3.0.convs2.2 | Conv1d                   | 3.1 K
104 | generator.resblocks.3.1          | ResBlock1                | 43.4 K
105 | generator.resblocks.3.1.convs1   | ModuleList               | 21.7 K
106 | generator.resblocks.3.1.convs1.0 | Conv1d                   | 7.2 K
107 | generator.resblocks.3.1.convs1.1 | Conv1d                   | 7.2 K
108 | generator.resblocks.3.1.convs1.2 | Conv1d                   | 7.2 K
109 | generator.resblocks.3.1.convs2   | ModuleList               | 21.7 K
110 | generator.resblocks.3.1.convs2.0 | Conv1d                   | 7.2 K
111 | generator.resblocks.3.1.convs2.1 | Conv1d                   | 7.2 K
112 | generator.resblocks.3.1.convs2.2 | Conv1d                   | 7.2 K
113 | generator.resblocks.3.2          | ResBlock1                | 68.0 K
114 | generator.resblocks.3.2.convs1   | ModuleList               | 34.0 K
115 | generator.resblocks.3.2.convs1.0 | Conv1d                   | 11.3 K
116 | generator.resblocks.3.2.convs1.1 | Conv1d                   | 11.3 K
117 | generator.resblocks.3.2.convs1.2 | Conv1d                   | 11.3 K
118 | generator.resblocks.3.2.convs2   | ModuleList               | 34.0 K
119 | generator.resblocks.3.2.convs2.0 | Conv1d                   | 11.3 K
120 | generator.resblocks.3.2.convs2.1 | Conv1d                   | 11.3 K
121 | generator.resblocks.3.2.convs2.2 | Conv1d                   | 11.3 K
122 | generator.conv_post              | Conv1d                   | 226
123 | mpd                              | MultiPeriodDiscriminator | 41.1 M
124 | mpd.discriminators               | ModuleList               | 41.1 M
125 | mpd.discriminators.0             | DiscriminatorP           | 8.2 M
126 | mpd.discriminators.0.convs       | ModuleList               | 8.2 M
127 | mpd.discriminators.0.convs.0     | Conv2d                   | 224
128 | mpd.discriminators.0.convs.1     | Conv2d                   | 20.7 K
129 | mpd.discriminators.0.convs.2     | Conv2d                   | 328 K
130 | mpd.discriminators.0.convs.3     | Conv2d                   | 2.6 M
131 | mpd.discriminators.0.convs.4     | Conv2d                   | 5.2 M
132 | mpd.discriminators.0.conv_post   | Conv2d                   | 3.1 K
133 | mpd.discriminators.1             | DiscriminatorP           | 8.2 M
134 | mpd.discriminators.1.convs       | ModuleList               | 8.2 M
135 | mpd.discriminators.1.convs.0     | Conv2d                   | 224
136 | mpd.discriminators.1.convs.1     | Conv2d                   | 20.7 K
137 | mpd.discriminators.1.convs.2     | Conv2d                   | 328 K
138 | mpd.discriminators.1.convs.3     | Conv2d                   | 2.6 M
139 | mpd.discriminators.1.convs.4     | Conv2d                   | 5.2 M
140 | mpd.discriminators.1.conv_post   | Conv2d                   | 3.1 K
141 | mpd.discriminators.2             | DiscriminatorP           | 8.2 M
142 | mpd.discriminators.2.convs       | ModuleList               | 8.2 M
143 | mpd.discriminators.2.convs.0     | Conv2d                   | 224
144 | mpd.discriminators.2.convs.1     | Conv2d                   | 20.7 K
145 | mpd.discriminators.2.convs.2     | Conv2d                   | 328 K
146 | mpd.discriminators.2.convs.3     | Conv2d                   | 2.6 M
147 | mpd.discriminators.2.convs.4     | Conv2d                   | 5.2 M
148 | mpd.discriminators.2.conv_post   | Conv2d                   | 3.1 K
149 | mpd.discriminators.3             | DiscriminatorP           | 8.2 M
150 | mpd.discriminators.3.convs       | ModuleList               | 8.2 M
151 | mpd.discriminators.3.convs.0     | Conv2d                   | 224
152 | mpd.discriminators.3.convs.1     | Conv2d                   | 20.7 K
153 | mpd.discriminators.3.convs.2     | Conv2d                   | 328 K
154 | mpd.discriminators.3.convs.3     | Conv2d                   | 2.6 M
155 | mpd.discriminators.3.convs.4     | Conv2d                   | 5.2 M
156 | mpd.discriminators.3.conv_post   | Conv2d                   | 3.1 K
157 | mpd.discriminators.4             | DiscriminatorP           | 8.2 M
158 | mpd.discriminators.4.convs       | ModuleList               | 8.2 M
159 | mpd.discriminators.4.convs.0     | Conv2d                   | 224
160 | mpd.discriminators.4.convs.1     | Conv2d                   | 20.7 K
161 | mpd.discriminators.4.convs.2     | Conv2d                   | 328 K
162 | mpd.discriminators.4.convs.3     | Conv2d                   | 2.6 M
163 | mpd.discriminators.4.convs.4     | Conv2d                   | 5.2 M
164 | mpd.discriminators.4.conv_post   | Conv2d                   | 3.1 K
165 | msd                              | MultiScaleDiscriminator  | 29.6 M
166 | msd.discriminators               | ModuleList               | 29.6 M
167 | msd.discriminators.0             | DiscriminatorS           | 9.9 M
168 | msd.discriminators.0.convs       | ModuleList               | 9.9 M
169 | msd.discriminators.0.convs.0     | Conv1d                   | 2.0 K
170 | msd.discriminators.0.convs.1     | Conv1d                   | 168 K
171 | msd.discriminators.0.convs.2     | Conv1d                   | 84.2 K
172 | msd.discriminators.0.convs.3     | Conv1d                   | 336 K
173 | msd.discriminators.0.convs.4     | Conv1d                   | 1.3 M
174 | msd.discriminators.0.convs.5     | Conv1d                   | 2.7 M
175 | msd.discriminators.0.convs.6     | Conv1d                   | 5.2 M
176 | msd.discriminators.0.conv_post   | Conv1d                   | 3.1 K
177 | msd.discriminators.1             | DiscriminatorS           | 9.9 M
178 | msd.discriminators.1.convs       | ModuleList               | 9.9 M
179 | msd.discriminators.1.convs.0     | Conv1d                   | 2.2 K
180 | msd.discriminators.1.convs.1     | Conv1d                   | 168 K
181 | msd.discriminators.1.convs.2     | Conv1d                   | 84.5 K
182 | msd.discriminators.1.convs.3     | Conv1d                   | 336 K
183 | msd.discriminators.1.convs.4     | Conv1d                   | 1.3 M
184 | msd.discriminators.1.convs.5     | Conv1d                   | 2.7 M
185 | msd.discriminators.1.convs.6     | Conv1d                   | 5.2 M
186 | msd.discriminators.1.conv_post   | Conv1d                   | 3.1 K
187 | msd.discriminators.2             | DiscriminatorS           | 9.9 M
188 | msd.discriminators.2.convs       | ModuleList               | 9.9 M
189 | msd.discriminators.2.convs.0     | Conv1d                   | 2.2 K
190 | msd.discriminators.2.convs.1     | Conv1d                   | 168 K
191 | msd.discriminators.2.convs.2     | Conv1d                   | 84.5 K
192 | msd.discriminators.2.convs.3     | Conv1d                   | 336 K
193 | msd.discriminators.2.convs.4     | Conv1d                   | 1.3 M
194 | msd.discriminators.2.convs.5     | Conv1d                   | 2.7 M
195 | msd.discriminators.2.convs.6     | Conv1d                   | 5.2 M
196 | msd.discriminators.2.conv_post   | Conv1d                   | 3.1 K
197 | msd.meanpools                    | ModuleList               | 0
198 | msd.meanpools.0                  | AvgPool1d                | 0
199 | msd.meanpools.1                  | AvgPool1d                | 0
200 | feature_loss                     | FeatureMatchingLoss      | 0
201 | discriminator_loss               | DiscriminatorLoss        | 0
202 | generator_loss                   | GeneratorLoss            | 0
--------------------------------------------------------------------------------
84.7 M    Trainable params
0         Non-trainable params
84.7 M    Total params
338.643   Total estimated model params size (MB)

As the model starts training, you should see a progress bar per epoch.

Copy
Copied!

            
            Epoch 0:   4%|▋               | 35/789 [00:37<13:05,  1.04s/it, g_l1_loss=1.240]

At the end of training, TAO Toolkit will save the last checkpoint at the path specified by the experiment spec file before finishing.

Copy
Copied!

            
            [NeMo I 2021-01-20 22:38:48 train:120] Experiment logs saved to '$RESULTS_DIR/vocoder/train'
[NeMo I 2021-01-20 22:38:48 train:123] Trained model saved to '$RESULTS_DIR/vocoder/train/checkpoints/trained-model.tlt'
INFO: Internal process exited

Current Limitations

Currently, only .wav audio files are supported.
The vocoder can only be trained from scratch.

Running Inference on a Model

To perform inference on individual text lines, use the following command:

Copy
Copied!

            
            tao vocoder infer -e <experiment_spec> \
            -m <model_checkpoint.tlt> \
            -g <num_gpus> \
            -k $KEY \
            -r </path/to/results/directory/for/logs> \
            output_path=</path/to/result/directory/for/vocoder/inference> \
            input_path=</path/to/the/input/spectrogram/from/spectro_gen/infer

Required Arguments

-e: The experiment specification file to set up inference. This spec file only needs a file_paths parameter that contains a list of individual file paths.
-m: The path to the model checkpoint, which should be a .tlt file.
-k: The key to encrypt the model

Optional Arguments

-g: The number of GPUs to use for inference in a multi-GPU scenario. The default value is 1.
-r: The path to the results and log directory. Log files, checkpoints, etc. will be stored here.
Other arguments to override fields in the specification file.

Inference Procedure

At the start of inference, TAO Toolkit will print out the experiment specification, including the audio filepaths on which inference will be performed.

When restoring from the checkpoint, it will then log the original datasets that the checkpoint model was trained and evaluated on. This will show the vocabulary that the model was trained on.

Copy
Copied!

            
            [NeMo W 2021-11-02 23:53:51 nemo_logging:349] /opt/conda/lib/python3.8/site-packages/torchaudio-0.7.0a0+42d447d-py3.8-linux-x86_64.egg/torchaudio/backend/utils.py:53: UserWarning: "sox" backend is being deprecated. The default backend will be changed to "sox_io" backend in 0.8.0 and "sox" backend will be removed in 0.9.0. Please migrate to "sox_io" backend. Please refer to https://github.com/pytorch/audio/issues/903 for the detail.
      warnings.warn(

[NeMo W 2021-11-02 23:53:51 experimental:27] Module <class 'nemo.collections.asr.data.audio_to_text_dali._AudioTextDALIDataset'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2021-11-02 23:53:54 nemo_logging:349] /opt/conda/lib/python3.8/site-packages/torchaudio-0.7.0a0+42d447d-py3.8-linux-x86_64.egg/torchaudio/backend/utils.py:53: UserWarning: "sox" backend is being deprecated. The default backend will be changed to "sox_io" backend in 0.8.0 and "sox" backend will be removed in 0.9.0. Please migrate to "sox_io" backend. Please refer to https://github.com/pytorch/audio/issues/903 for the detail.
      warnings.warn(

[NeMo W 2021-11-02 23:53:55 experimental:27] Module <class 'nemo.collections.asr.data.audio_to_text_dali._AudioTextDALIDataset'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2021-11-02 23:53:55 nemo_logging:349] /home/jenkins/agent/workspace/tlt-pytorch-main-nightly/tts/vocoder/scripts/infer.py:90: UserWarning:
    'infer.yaml' is validated against ConfigStore schema with the same name.
    This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
    See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.

[NeMo I 2021-11-02 23:53:55 tlt_logging:20] Experiment configuration:
    restore_from: /results/vocoder/train/checkpoints/trained-model.tlt
    exp_manager:
      task_name: infer
      explicit_log_dir: /results/vocoder/infer
    input_path: /results/spectro_gen/infer/spectro
    output_path: /results/vocoder/infer/wav
    sample_rate: 22050
    encryption_key: '***'

[NeMo W 2021-11-02 23:53:55 exp_manager:26] Exp_manager is logging to `/results/vocoder/infer``, but it already exists.
[NeMo I 2021-11-02 23:54:06 features:252] PADDING: 0
[NeMo I 2021-11-02 23:54:06 features:269] STFT using torch
[NeMo I 2021-11-02 23:54:06 features:271] STFT using exact pad
[NeMo I 2021-11-02 23:54:06 features:252] PADDING: 0
[NeMo I 2021-11-02 23:54:06 features:269] STFT using torch
[NeMo I 2021-11-02 23:54:06 features:271] STFT using exact pad
[NeMo I 2021-11-02 23:54:13 infer:73] The prediction results:
[NeMo I 2021-11-02 23:54:14 infer:83] Predicted audio: /results/vocoder/infer/wav/0.wav
[NeMo I 2021-11-02 23:54:15 infer:83] Predicted audio: /results/vocoder/infer/wav/1.wav
[NeMo I 2021-11-02 23:54:15 infer:83] Predicted audio: /results/vocoder/infer/wav/2.wav
[NeMo I 2021-11-02 23:54:15 infer:86] Experiment logs saved to '/results/vocoder/infer'
[0m[0m2021-11-02 16:54:17,240 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

The path to the Mel spectrograms generated by the infer task is shown in the last lines of the log.

Current Limitations

Currently, only .wav audio files are generated.

Fine-Tuning the Model

To fine-tune a model from a checkpoint, use the following command:

Copy
Copied!

            
            !tao vocoder finetune \
            -e <experiment_spec> \
            -g <num_gpus> \
            -m <model_checkpoint> \
            train_dataset=<train.json> \
            validation_dataset=<val.json> \
            trainer.max_steps=1000

Required Arguments

-e: The experiment specification file to set up fine-tuning.
-m: Path to the model checkpoint from which to fine-tune. Should be a .tlt file.
train_dataset: The path to the training manifest. Please see the section below on
finetuning data.
validation_dataset: The path to the validation manifest.
trainer.max_steps: Number of steps used to finetune the model. We recommend adding 500
for each minute in the finetuning data.

Optional Arguments

-g: The number of GPUs to be used for fine-tuning in a multi-GPU scenario (default: 1).
-r: The path to the results and log directory. Log files, checkpoints, etc., will be stored here.
Other arguments to override fields in the specification file.

Finetuning Dataset

For best results if using a FastPitch and HiFiGAN combination, finetuning HiFiGAN should be done on the outputs from a finetuned FastPitch model. In order to do this, you must have a finetuned FastPitch model, do inference with the FastPitch model, and update the training .json to have a mel_filepath: attribute for each .wav file.

Let’s do inference with FastPitch first.

Copy
Copied!

            
            !tao spectro_gen infer \
     -e <experiment_spec> \
     -g <num_gpus> \
     -m <model_checkpoint> \
     output_path=<An empty directory where the specs are saved> \
     speaker=1 \
     mode="infer_hifigan_ft" \
     input_json=<train.json>

The important arguments are:

output_path: The directory where the spectrograms are saved
input_json: The .json file that contains the finetuning data
mode="infer_hifigan_ft": Must be specified
speaker: The FastPitch speaker id. Should be 1 in most cases.

After this is done running, inside the output_path: there should be files such as 1.npy, 2.npy, … and so on. For each line inside of input_json:, please add a mel_filepath: attribute that corresponds to the saved spectrograms. For example, line 1 in input_json: should have "mel_filepath": "<PATH_TO_OUTPUT_PATH>/1.npy":.

Now you can run hifigan finetuning using your updated input_json: as train_dataset:.

Model Export

You can export a trained HiFiGAN model to Riva format, which contains all the model artifacts necessary for deployment to Riva Services. For more details about Riva, see this page.

To export a HiFiGAN model to the Riva format, use the following command:

Copy
Copied!

            
            tao vocoder export -e <experiment_spec> \
                       -m <model_checkpoint> \
                       -r <results_dir> \
                       -k <encryption_key> \
                       export_format=RIVA \
                       export_to=<filename.riva>

Required Arguments

-e: The experiment specification file for export. See the Export Spec File section below for more details.
-m: The path to the model checkpoint to be exported, which should be a .tlt file
-k: The encryption key

Optional Arguments

-r: The path to the directory where results will be stored.

Export Spec File

The following is an example spec file for model export:

Copy
Copied!

            
            # Name of the .tlt EFF archive to be loaded/model to be exported.
restore_from: trained-model.tlt

# Set export format: RIVA
export_format: RIVA

# Output EFF archive containing model checkpoint and artifacts required for Riva Services
export_to: exported-model.riva

Parameter	Datatype	Description	Default
`restore_from`	string	The path to the pre-trained model to be exported	`trained_model.tlt`
`export_format`	string	The export format	N/A
`export_to`	string	The target path for the export model	`exported-model.riva`

A successful run of the model export generates the following log:

Copy
Copied!

            
            [NeMo W 2021-11-02 23:56:28 nemo_logging:349] /opt/conda/lib/python3.8/site-packages/torchaudio-0.7.0a0+42d447d-py3.8-linux-x86_64.egg/torchaudio/backend/utils.py:53: UserWarning: "sox" backend is being deprecated. The default backend will be changed to "sox_io" backend in 0.8.0 and "sox" backend will be removed in 0.9.0. Please migrate to "sox_io" backend. Please refer to https://github.com/pytorch/audio/issues/903 for the detail.
      warnings.warn(

[NeMo W 2021-11-02 23:56:29 experimental:27] Module <class 'nemo.collections.asr.data.audio_to_text_dali._AudioTextDALIDataset'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2021-11-02 23:56:32 nemo_logging:349] /opt/conda/lib/python3.8/site-packages/torchaudio-0.7.0a0+42d447d-py3.8-linux-x86_64.egg/torchaudio/backend/utils.py:53: UserWarning: "sox" backend is being deprecated. The default backend will be changed to "sox_io" backend in 0.8.0 and "sox" backend will be removed in 0.9.0. Please migrate to "sox_io" backend. Please refer to https://github.com/pytorch/audio/issues/903 for the detail.
      warnings.warn(

[NeMo W 2021-11-02 23:56:32 experimental:27] Module <class 'nemo.collections.asr.data.audio_to_text_dali._AudioTextDALIDataset'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2021-11-02 23:56:33 nemo_logging:349] /home/jenkins/agent/workspace/tlt-pytorch-main-nightly/tts/vocoder/scripts/export.py:85: UserWarning:
    'export.yaml' is validated against ConfigStore schema with the same name.
    This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
    See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.

[NeMo I 2021-11-02 23:56:33 tlt_logging:20] Experiment configuration:
    restore_from: /results/vocoder/train/checkpoints/trained-model.tlt
    export_to: vocoder.riva
    export_format: RIVA
    exp_manager:
      task_name: export
      explicit_log_dir: /results/vocoder/export
    encryption_key: '**********'

[NeMo W 2021-11-02 23:56:33 exp_manager:26] Exp_manager is logging to `/results/vocoder/export``, but it already exists.
[NeMo I 2021-11-02 23:56:43 features:252] PADDING: 0
[NeMo I 2021-11-02 23:56:43 features:269] STFT using torch
[NeMo I 2021-11-02 23:56:43 features:271] STFT using exact pad
[NeMo I 2021-11-02 23:56:43 features:252] PADDING: 0
[NeMo I 2021-11-02 23:56:43 features:269] STFT using torch
[NeMo I 2021-11-02 23:56:43 features:271] STFT using exact pad
[NeMo I 2021-11-02 23:56:50 export:57] Model restored from '/results/vocoder/train/checkpoints/trained-model.tlt'
Removing weight norm...
[NeMo I 2021-11-02 23:57:03 export:72] Experiment logs saved to '/results/vocoder/export'
[NeMo I 2021-11-02 23:57:03 export:73] Exported model to '/results/vocoder/export/vocoder.riva'
[NeMo I 2021-11-02 23:57:04 export:80] Exported model is compliant with Riva