Voice Command Detection

The Voice Command Detection feature allows users to identify short commands from user’s speech. These commands are a sequence of keywords. The commands can be used to trigger corresponding actions. For example, the robot can detect the voice command “Carter, get popcorn” from the user and trigger the action of getting popcorn. This is a lightweight system which runs natively on Jetson platforms recognizing short commands constructed out of a limited set of keywords. This is different from a typical automatic speech recognition (ASR) system which can recognize a large vocabulary and generally requires significant system resources.

This feature comprises of a node with 3 codelets: Voice Command Feature Extraction, Tensorflow Inference and Voice Command Construction.

An NVIDIA deep-learning architecture is used to detect keywords by this feature. The model for the required keywords has to be trained using the Voice Command Training application.

Voice Command Feature Extraction

The Voice Command Feature Extraction codelet receives audio packets as input (for example, from the Audio Capture codelet). It extracts spectral features from the audio packets using DSP algorithms.

The extracted features are:

  • Mel-frequency Cepstral Coefficients (MFCC)
  • First order Delta of the MFCC
  • Second order Delta of the MFCC

The computed features are normalized and stacked to form the output of this codelet.

Configuration Parameters

Parameter Description Default
audio_channel_index Input audio packets can be multi-channeled. This parameter specifies the channel index to be used for voice command detection. 0
minimum_time_between_inferences Minimum duration (measured in seconds) between two consecutive keyword detection inferences. This value defines the frequency of detecting keywords. Range [0.1, 1.0] 0.1

The parameters below are generated as a metadata file along with the model, after training. These should not vary from the trained configuration.

Parameter Description Type
sample_rate Supported sample rate of the audio packets. Int
fft_length Length of Fourier transform window (as number of samples) Int
num_mels Number of mel bins to be extracted. Int
num_mfcc Number of Mel-frequency cepstral coefficients to be computed. Int
start_coefficient Index of the starting cepstral coefficient to be computed. Int
hop_size Stride for consecutive Fourier transform windows. Int
window_length Length of the window of audio packets which is used for keyword detection. This is the number of time frames after computing STFT with above parameters. Int
mean Mean feature map constructed from the training dataset. List of Floats
sigma Standard deviation of the feature map. List of Floats


Message Proto Type Name
Input AudioDataProto audio_packets
Output TensorListProto feature_tensors

Voice Command Construction

The Voice Command Construction codelet calls into the Command Constructor algorithm illustrated in image below. The algorithm takes a list of keyword probabilities (ideally from the inference output) at each tick and identifies a command over a period of time as shown in the following diagram:


Configuration Parameters

Parameter Description Type
command_list The list of commands which need to be detected. Each command is a string of keywords separated by spaces. All commands should start with the same keyword. Only the trained keywords should be used in the commands. List of Strings

The IDs associated with commands listed above. This is a 1:1 mapping with command_list. This parameter should have the same number of IDs as the number of commands in command_list.

Each command has to be assigned an ID, present in the output message of the Voice Command Construction codelet when that specific command is detected. This ID can be used to trigger an action by the module receiving this message. The IDs need not be unique. For example, two commands ‘carter bring popcorn’ and ‘carter get popcorn’ could represent same action and have same command ID.

List of Ints
max_frames_allowed_after_keyword_detected Maximum number of audio windows to wait for a defined command after the trigger keyword is detected. Int

The parameters below are generated as a metadata file along with model after training. These should not vary from the trained configuration.

Parameter Description Type
num_classes The number of keywords. Int
classes The list of classes/keywords in the same order as those in the output of model inference. List of Strings
thresholds The probability thresholds per class/keyword. List of Floats


Message Proto Type Name
Input TensorListProto feature_tensors
Output VoiceCommandDetectionProto detected_command

Sight Variables

The keyword probabilities received as input are normalized after applying thresholds. These normalized probabilities are available in Sight as p_<keyword>. For example, probability of the keyword Carter will be available as p_carter.

The detected command ID is available in Sight as voice_command_id.

Sample Application

The voice command detection sample application demonstrates the voice command detection feature with a sample pre-trained model. This model is trained using partial Google Speech Commands dataset. This pre-trained model supports below list of keywords:

Marvin One Six Left
Sheila Two Seven Right
On Three Eight Up
Off Four Nine Down
Stop Five Zero  

These keywords can be used in any combination to form a list of commands. It is to be noted that all commands should start with same keyword. This application is configured to detect the following commands:

  • Marvin stop sheila
  • Marvin stop on zero
  • Marvin right sheila
  • Marvin up four
  • Marvin down nine
  • Marvin left down
  • Marvin off zero

The application has Audio capture, Voice command feature extraction, Tensorflow inference and Voice command construction components connected in the same order. The audio capture component is configured for a 6-channel microphone array with audio captured at 16kHz. The voice command feature extraction component is configure to use the first channel (0-index) of the audio packets for command detection. The pre-trained model is has been trained for 16kHz audio and hence the sample_rate of audio capture component should match it for accurate detection.

To use the application, connect a microphone to the host/device and set it as default audio capture device in the system settings. Set the capture volume of the microphone to 100%. Configure the audio capture component (num_channels) and the voice command feature extraction component (audio_channel_index) to match the specifications of the connected microphone. Run the application and wait until the initialization of all the components is complete. The log message “Listening for command” is printed on the console once the application is ready to detect commands. Whenever a command is detected, the command ID and the command itself are printed on the console. It is important to speak slowly and clearly with minor pauses between keywords for reliable detection.

The application also plots the detected command ID. This plot is accessible in the Sight UI at http://localhost:3000 for the desktop or http://ROBOTIP:3000 for Jetson.

Platforms: Desktop, Jetson TX/2, Jetson Xavier, Jetson Nano

Hardware: Any microphone

Creating Your Own Application

To the use voice command detection feature in your own application, train a model with required keywords using the Voice Command Training application. The model and metadata file generated by voice command training application should be linked into your application configuration as outlined below:

  1. Create 3 nodes for the 3 components: Voice Command Feature Extraction, Tensorflow Inference and Voice Command Construction.

  2. Connect these 3 components in order. The input to this feature is connected to voice command feature extraction and output is obtained from voice command construction.

  3. Configure the audio_channel_index and minimum_time_between_inferences parameters in the voice command feature extraction component.

  4. Configure the model_file_path and config_file_path parameters in the TensorFlow inference component.

  5. Configure the command_list, command_ids and max_frames_allowed_after_keyword_detected parameters in the voice command construction component.

  6. The metadata file provides placeholders for the node names of each of the 3 codelets - Voice Command Feature Extraction, Tensorflow Inference and Voice Command Construction. Update these placeholders with the corresponding node names.

    See Sample Application for information on using a single node instead of 3 different nodes.

  7. Add the metadata file as a secondary configuration file to your application by using the config_files parameter in the application’s JSON file (see Sample Application) or by passing it on the command line as explained in Running an Application.

  8. Note that the audio sample rate used in the Voice Command Training should match the sample rate of the incoming audio packets to this feature.