Voice Command Training

Voice command training involves training a keyword detection model for a fixed set of keywords defined by the user.

This process has three phases - Dataset Collection, Data Pre-processing and Model Training.


A microphone is required to record the audio clips. Supported microphones are:

  • Any headphones with built-in microphones

  • Built-in computer microphones

  • Multichannel microphone arrays. These should have the capability to provide post-processed, mixed, or mono channel data.

The microphone should be placed correctly to capture good quality recordings in a clean environment. Please follow the best practices of recording audio data, e.g., described here


Any audio recording software can be used to generate the audio clips, such as Audacity.

To record an audio clip using audacity:

  1. Before opening Audacity, plug in the microphone.

  2. Open the Preferences Dialog Box: Edit > Preferences (or Ctrl + P)

  3. On the Devices tab, under Recording, select the microphone from the pull-down menu, and on the Channels menu select 1 (Mono).

  4. On the Quality tab, check that the Default sample rate is 16000 Hz, and change Default sample format to 16-bit.

  5. Press the Record button to start recording.

  6. Press the Stop button to stop recording after speaking out the keyword.

  7. Export the audio clip as WAV (Microsoft) signed 16-bit PCM.

Training and Validation Dataset Generation

Setup your microphone as explained in the above sections, and finalize the list of keywords to be trained. Record one keyword at a time as separate audio clips, at least ten clips per keyword, per speaker. Record at least 20 speakers.

Generate audio clips for Unknown keywords class. Please refer to the Best Practices for details. Segment the generated clips into Training (90%) and Validation (10%) datasets. Both training and validation datasets should follow the directory structure <dataset_root>/<keyword>/<audio_clip>.wav.

Audio clips of a specific keyword from all speakers should be placed in a directory named after the keyword in the respective dataset directory. Unknown keyword class should always be named unknownkeywords.

All audio clips must be 16khz, 16-bit mono wav files. No spaces or special characters (except hyphen and underscore) are allowed in naming of the keyword directories or audio clips.

Noise Profile Generation

You must provide noise profiles for the targeted deployment environments. The profiles should include audio recordings of different background noises present in the target deployment environment, for example, fan noise, different machines sounds, background people chatter, music etc.

Some pre-recorded noise profiles can be found in UrbanSound dataset.

Record the targeted noises using Audacity or any other recording softwares as wav files. Each noise clip should be at least one minute long. All noise clips must be 16khz, 16-bit mono wav files. No spaces or special characters (except hyphen and underscore) are allowed in naming of the noise clips.

Best Practices

The dataset should contain recordings from speakers with diverse accents in both training and validation datasets. A larger dataset in terms of both number of speakers and number of audio clips per speaker improves the accuracy of the trained model.

Training and validation datasets should be from different speakers to get reliable validation accuracy. The same audio clips should not be repeated as they will not add any improvement. The duration of audio clips should be more than the configured keyword duration for better accuracy. Silence or any other speech has to be removed from the beginning of audio clips.

The keyword clips should not contain any other word for more than 30% of the configured keyword duration. Audio clips provided for the Unknown keyword class can include:

  • Words that are not part of the target detection keyword set.

  • Random speech clips which do not contain the target keywords.

  • Any other sounds, which are expected in the target deployment environment.

Ideally, the Unknown keyword class in the training dataset should be at least as big as the rest of the keyword classes combined together. If n noise profiles/clips are provided, the unknown keyword class in the validation dataset should be at least \((2*n+30)\) times bigger than the rest of the keyword classes combined together.

Data pre-processing involves data augmentation and feature extraction. The input dataset is first modified with multiple augmentations like:

  • Mixing input noise profiles at varying intensities

  • Time stretching

  • Pitch shifting

  • Dynamic range compression (DRC)

These augmentations help in generalizing the dataset for different environments and speaking styles. The spectral features of this augmented data are then extracted. The extracted feature set includes:

  • Mel-Frequency Spectrogram

  • First order Delta of Mel spectrogram

  • Second order Delta of Mel spectrogram

This process outputs extracted features.

A DL architecture is designed to map extracted features to keyword probabilities for each keyword in the dataset. The training phase constitutes training the keyword detection network using the features extracted in the data pre-processing stage. The network will be trained to converge for the best validation dataset accuracy.

The Voice Command Training application outputs the Keyword Detection Model and Metadata files upon successful execution. These two files are used by the Voice Command Detection feature for recognizing the commands.

Model training has the following limitations:

  • The DL architecture used can detect up to 20 keywords with high accuracy. Accuracy might decrease as the number of keywords increase.

  • For better performance, the keywords should be of approximately equal length. Large variations in keyword lengths degrades performance.

  • Minimum keyword duration is 100 ms and maximum is 1000 ms for reliable detection. This duration should be larger than a single audio packet duration.

  • A microphone which captures audio data at high SNR is required for reliable detection.

The training application can be triggered by running the following command from the Isaac SDK root directory.


bob@desktop:~/isaac$ bazel run apps/samples/voice_command_detection:training -- <training_options>

The below list of training options are supported by the application.


Absolute path to the training dataset.

--validation_dataset_path VALIDATION_DATASET_PATH

Absolute path to the validation dataset.

-n, --augment-noise

Enable noise augmentation for training. Default: disabled

--noise_profile_path NOISE_PROFILE_PATH

Absolute path to the noise profiles (wav files).

--tmpdir TMPDIR

Path to a directory where the processed data and checkpoints are temporarily stored. Default: /tmp

--logdir LOGDIR

Path to directory where training logs are stored for Tensorboard usage. Default: <tmpdir>/logs


Path to directory where the trained model and metadata are stored.


List of keywords to be detected. Keywords can be separated by a comma in the list. Eg.: -k carter,look,stop, -k carter look -k stop

--keyword_duration KEYWORD_DURATION

Duration of keywords in seconds in the range [0.1, 1]. Default: 0.5

--training_epochs TRAINING_EPOCHS

Number of epochs to run the training. Default: 100

--batch_size BATCH_SIZE

Batch size used for training. Default: 32

--minimum_noise_gain MINIMUM_NOISE_GAIN

Minimum noise gain applied during noise augmentation. Default: 0.1

--maximum_noise_gain MAXIMUM_NOISE_GAIN

Maximum noise gain applied during noise augmentation. Default: 0.4

--learning_rate LEARNING_RATE, --lr LEARNING_RATE

Learning rate used for Adamax optimizer. Default: 1e-5

--dropout DROPOUT

Dropout value used for training the network. Default: 0.3

--checkpoint CHECKPOINT

Keras checkpoint to be loaded to continue training. This assumes that the extracted features are available <tmpdir>/features/. Defaults to not loading checkpoints and starting fresh.

-e EPOCH_NUMBER, --epoch_number EPOCH_NUMBER

Epoch at which to start training when resuming from checkpoint. Default: 0

--gpu_memory_usage GPU_MEMORY_USAGE

Specifies a limit for the usage of GPU memory in the range [0, 1]. Default: 0 (no limit.)

--config_filename CONFIG_FILENAME

Path to load a JSON file with all the configuration parameters. However, Command line arguments take the priority.

-h, --help

Show help message and exit.

The training application generates the model and its corresponding metadata file in the specified output folder. To use these for Voice Command Detection, update the configuration of the application to point to this model and use the metadata file as a secondary configuration file.

The metadata file provides placeholders for the node names of each of the 3 codelets: Voice Command Feature Extraction, Tensorflow Inference and Voice Command Construction. Update these placeholders with the corresponding node names.

Note that if two or all three of these codelets share the same node, merge them under a single node name. Providing the same node name separately for each of these codelets causes a mismatch in configuration.

© Copyright 2018-2020, NVIDIA Corporation. Last updated on Oct 31, 2023.