Voice Command Training

Voice command training involves training a keyword detection model for a fixed set of keywords defined by the user.

This process has three phases - Dataset Collection, Data Pre-processing and Model Training.


A microphone is required to record the audio clips. Supported microphones are:

  • Any headphones with built-in microphones

  • Built-in computer microphones

  • Multichannel microphone arrays. These should have the capability to provide post-processed, mixed, or mono channel data.

The microphone should be placed correctly to capture good quality recordings in a clean environment. Please follow the best practices of recording audio data, e.g., described here


Any audio recording software can be used to generate the audio clips, such as Audacity.

To record an audio clip using audacity:

  1. Before opening Audacity, plug in the microphone.

  2. Open the Preferences Dialog Box: Edit > Preferences (or Ctrl + P)

  3. On the Devices tab, under Recording, select the microphone from the pull-down menu, and on the Channels menu select 1 (Mono).

  4. On the Quality tab, check that the Default sample rate is 16000 Hz, and change Default sample format to 16-bit.

  5. Press the Record button to start recording.

  6. Press the Stop button to stop recording after speaking out the keyword.

  7. Export the audio clip as WAV (Microsoft) signed 16-bit PCM.

Training and Validation Dataset Generation

Setup your microphone as explained in the above sections, and finalize the list of keywords to be trained. Record one keyword at a time as separate audio clips, at least ten clips per keyword, per speaker. Record at least 20 speakers.

Generate audio clips for Unknown keywords class. Please refer to the Best Practices for details. Segment the generated clips into Training (90%) and Validation (10%) datasets. Both training and validation datasets should follow the directory structure / / .wav .

Audio clips of a specific keyword from all speakers should be placed in a directory named after the keyword in the respective dataset directory. Unknown keyword class should always be named unknownkeywords.

All audio clips must be 16khz, 16-bit mono wav files. No spaces or special characters (except hyphen and underscore) are allowed in naming of the keyword directories or audio clips.

Noise Profile Generation

You must provide noise profiles for the targeted deployment environments. The profiles should include audio recordings of different background noises present in the target deployment environment, for example, fan noise, different machines sounds, background people chatter, music etc.

Some pre-recorded noise profiles can be found in UrbanSound dataset.

Record the targeted noises using Audacity or any other recording softwares as wav files. Each noise clip should be at least one minute long. All noise clips must be 16khz, 16-bit mono wav files. No spaces or special characters (except hyphen and underscore) are allowed in naming of the noise clips.

Best Practices

The dataset should contain recordings from speakers with diverse accents in both training and validation datasets. A larger dataset in terms of both number of speakers and number of audio clips per speaker improves the accuracy of the trained model.

Training and validation datasets should be from different speakers to get reliable validation accuracy. The same audio clips should not be repeated as they will not add any improvement. The duration of audio clips should be more than the configured keyword duration for better accuracy. Silence or any other speech has to be removed from the beginning of audio clips.

The keyword clips should not contain any other word for more than 30% of the configured keyword duration. Audio clips provided for the Unknown keyword class can include:

  • Words that are not part of the target detection keyword set.

  • Random speech clips which do not contain the target keywords.

  • Any other sounds, which are expected in the target deployment environment.

Ideally, the Unknown keyword class in the training dataset should be at least as big as the rest of the keyword classes combined together. If n noise profiles/clips are provided, the unknown keyword class in the validation dataset should be at least \((2*n+30)\) times bigger than the rest of the keyword classes combined together.

Data pre-processing involves data augmentation and feature extraction. The input dataset is first modified with multiple augmentations like:

  • Mixing input noise profiles at varying intensities

  • Time stretching

  • Pitch shifting

  • Dynamic range compression (DRC)

These augmentations help in generalizing the dataset for different environments and speaking styles. The spectral features of this augmented data are then extracted. The extracted feature set includes:

  • Mel-Frequency Spectrogram

  • First order Delta of Mel spectrogram

  • Second order Delta of Mel spectrogram

This process outputs extracted features.

A DL architecture is designed to map extracted features to keyword probabilities for each keyword in the dataset. The training phase constitutes training the keyword detection network using the features extracted in the data pre-processing stage. The network will be trained to converge for the best validation dataset accuracy.

The Voice Command Training application outputs the Keyword Detection Model and Metadata files upon successful execution. These two files are used by the Voice Command Detection feature for recognizing the commands.

Model training has the following limitations:

  • The DL architecture used can detect up to 20 keywords with high accuracy. Accuracy might decrease as the number of keywords increase.

  • For better performance, the keywords should be of approximately equal length. Large variations in keyword lengths degrades performance.

  • Minimum keyword duration is 100 ms and maximum is 1000 ms for reliable detection. This duration should be larger than a single audio packet duration.

  • A microphone which captures audio data at high SNR is required for reliable detection.

The training application can be triggered by running the following command from the Isaac SDK root directory.


bob@desktop:~/isaac$ bazel run apps/samples/voice_command_detection:training -- <training_options>

The below list of training options are supported by the application.

-t <var>TRAIN_DATASET_PATH</var>, --train_dataset_path <var>TRAIN_DATASET_PATH</var>

Absolute path to the training dataset.

--validation_dataset_path <var>VALIDATION_DATASET_PATH</var>

Absolute path to the validation dataset.

-n, --augment-noise

Enable noise augmentation for training. Default: disabled

--noise_profile_path <var>NOISE_PROFILE_PATH</var>

Absolute path to the noise profiles (wav files).

--tmpdir <var>TMPDIR</var>

Path to a directory where the processed data and checkpoints are temporarily stored. Default: /tmp

--logdir <var>LOGDIR</var>

Path to directory where training logs are stored for Tensorboard usage. Default: /logs

-o <var>MODEL_OUTPUT_PATH</var>, --model_output_path <var>MODEL_OUTPUT_PATH</var>

Path to directory where the trained model and metadata are stored.

-k <var>KEYWORDS_LIST</var>, --keywords_list <var>KEYWORDS_LIST</var>

List of keywords to be detected. Keywords can be separated by a comma in the list. Eg.: -k carter,look,stop, -k carter look -k stop

--keyword_duration <var>KEYWORD_DURATION</var>

Duration of keywords in seconds in the range [0.1, 1]. Default: 0.5

--training_epochs <var>TRAINING_EPOCHS</var>

Number of epochs to run the training. Default: 100

--batch_size <var>BATCH_SIZE</var>

Batch size used for training. Default: 32

--minimum_noise_gain <var>MINIMUM_NOISE_GAIN</var>

Minimum noise gain applied during noise augmentation. Default: 0.1

--maximum_noise_gain <var>MAXIMUM_NOISE_GAIN</var>

Maximum noise gain applied during noise augmentation. Default: 0.4

--learning_rate <var>LEARNING_RATE</var>, --lr <var>LEARNING_RATE</var>

Learning rate used for Adamax optimizer. Default: 1e-5

--dropout <var>DROPOUT</var>

Dropout value used for training the network. Default: 0.3

--checkpoint <var>CHECKPOINT</var>

Keras checkpoint to be loaded to continue training. This assumes that the extracted features are available /features/ . Defaults to not loading checkpoints and starting fresh.

-e <var>EPOCH_NUMBER</var>, --epoch_number <var>EPOCH_NUMBER</var>

Epoch at which to start training when resuming from checkpoint. Default: 0

--gpu_memory_usage <var>GPU_MEMORY_USAGE</var>

Specifies a limit for the usage of GPU memory in the range [0, 1]. Default: 0 (no limit.)

--config_filename <var>CONFIG_FILENAME</var>

Path to load a JSON file with all the configuration parameters. However, Command line arguments take the priority.

-h, --help

Show help message and exit.

The training application generates the model and its corresponding metadata file in the specified output folder. To use these for Voice Command Detection, update the configuration of the application to point to this model and use the metadata file as a secondary configuration file.

The metadata file provides placeholders for the node names of each of the 3 codelets: Voice Command Feature Extraction, Tensorflow Inference and Voice Command Construction. Update these placeholders with the corresponding node names.

Note that if two or all three of these codelets share the same node, merge them under a single node name. Providing the same node name separately for each of these codelets causes a mismatch in configuration.

© Copyright 2018-2020, NVIDIA Corporation. Last updated on Feb 1, 2023.