The Voice Command Detection feature allows users to identify short commands from user’s speech. These commands are a sequence of keywords. The commands can be used to trigger corresponding actions. For example, the robot can detect the voice command “Carter, get popcorn” from the user and trigger the action of getting popcorn. This is a lightweight system which runs natively on Jetson platforms recognizing short commands constructed out of a limited set of keywords. This is different from a typical automatic speech recognition (ASR) system which can recognize a large vocabulary and generally requires significant system resources.
This feature comprises of a node with 3 codelets: Voice Command Feature Extraction, Tensorflow Inference and Voice Command Construction.
An NVIDIA deep-learning architecture is used to detect keywords by this feature. The model for the required keywords has to be trained using the Voice Command Training application.
The Voice Command Feature Extraction codelet receives audio packets as input (for example, from the Audio Capture codelet). It extracts spectral features from the audio packets using DSP algorithms.
The extracted features are:
Mel spectrogram
First order Delta of the Mel spectrogram
Second order Delta of the Mel spectrogram
The computed features are normalized and stacked to form the output of this codelet.
Configuration Parameters
Parameter |
Description |
Default |
---|---|---|
audio_channel_index |
Input audio packets can be multi-channeled. This parameter specifies the channel index to be used for voice command detection. |
0 |
minimum_time_between_inferences |
Minimum duration (measured in seconds) between two consecutive keyword detection inferences. This value defines the frequency of detecting keywords. Range [0.1, 1.0] |
0.1 |
The parameters below are generated as a metadata file along with the model, after training. These should not vary from the trained configuration.
Parameter |
Description |
Type |
---|---|---|
sample_rate |
Supported sample rate of the audio packets. |
Int |
fft_length |
Length of Fourier transform window (as number of samples) |
Int |
num_mels |
Number of mel bins to be extracted. |
Int |
hop_size |
Stride for consecutive Fourier transform windows. |
Int |
window_length |
Length of the window of audio packets which is used for keyword detection. This is the number of time frames after computing STFT with above parameters. |
Int |
Messages
Message |
Proto Type |
Name |
---|---|---|
Input |
AudioDataProto |
audio_packets |
Output |
TensorListProto |
feature_tensors |
The Voice Command Construction codelet calls into the Command Constructor algorithm illustrated in image below. The algorithm takes a list of keyword probabilities (ideally from the inference output) at each tick and identifies a command over a period of time as shown in the following diagram:

Configuration Parameters
Parameter |
Description |
Type |
---|---|---|
command_list |
The list of commands which need to be detected. Each command is a string of keywords separated by spaces. All commands should start with the same keyword. Only the trained keywords should be used in the commands. |
List of Strings |
command_ids |
The IDs associated with commands listed above. This is a 1:1 mapping with command_list. This parameter should have the same number of IDs as the number of commands in command_list. Each command has to be assigned an ID, present in the output message of the Voice Command Construction codelet when that specific command is detected. This ID can be used to trigger an action by the module receiving this message. The IDs need not be unique. For example, two commands ‘carter bring popcorn’ and ‘carter get popcorn’ could represent same action and have same command ID. |
List of Ints |
max_frames_allowed_after_keyword_detected |
Maximum number of audio windows to wait for a defined command after the trigger keyword is detected. |
Int |
probability_mean_window |
Window size over which the keyword probability predictions are averaged. |
Int |
The parameters below are generated as a metadata file along with model after training. These should not vary from the trained configuration.
Parameter |
Description |
Type |
---|---|---|
num_classes |
The number of keywords. |
Int |
classes |
The list of classes/keywords in the same order as those in the output of model inference. |
List of Strings |
thresholds |
The probability thresholds per class/keyword. |
List of Floats |
Messages
Message |
Proto Type |
Name |
---|---|---|
Input |
TensorListProto |
feature_tensors |
Output |
VoiceCommandDetectionProto |
detected_command |
The keyword probabilities received as input are available in Sight as
original_probabilities.<keyword>
. The values after normalizing these probabilities over a
window are available as mean_probabilities.<keyword>
. Thresholds are applied on the
normalized thresholds. These are available in Sight as thresholded_probabilities.<keyword>
.
For example, probabilities of the keyword Carter are available as
original_probabilities.carter
, mean_probabilities.carter
and
thresholded_probabilities.carter
.
The accumulated probabilities of each command are available in Sight as
accumulated_probabilities.<command>
. For example, the probability for the command “carter
stop” is available as accumulated_probabilities.carter stop
.
The detected command ID is available in Sight as voice_command_id
.
The voice command detection sample application demonstrates the voice command detection feature with a sample pre-trained model. This application is packaged with two models: one for each Carter and Kaya. These model are trained using a custom recorded dataset of US accent speeches. The Carter pre-trained model supports below list of keywords:
carter |
go |
stop |
ready |
yes |
no |
help |
tesla |
The Kaya pre-trained model also supports all the above keywords except for carter
which is
replaced by kaya
.
These keywords can be used in any combination to form a list of commands. This application is configured to detect the following commands:
Carter
Carter go
Carter Stop
Carter Yes
Carter No
Carter Ready
Carter Help
Carter Tesla
The application has Audio capture, Voice command feature extraction, Tensorflow inference and Voice command construction components connected in the same order. The audio capture component is configured for a 6-channel microphone array with audio captured at 16kHz. The voice command feature extraction component is configure to use the first channel (0-index) of the audio packets for command detection. The pre-trained model is has been trained for 16kHz audio and hence the sample_rate of audio capture component should match it for accurate detection.
To use the application, connect a microphone to the host/device and set it as default audio capture
device in the system settings. Set the capture volume of the microphone to 100%. Configure the audio
capture component (num_channels
) and the voice command feature extraction component
(audio_channel_index
) to match the specifications of the connected microphone. Run the
application and wait until the initialization of all the components is complete. The log message
“Listening for command” is printed on the console once the application is ready to detect commands.
It is important to speak slowly and clearly with minor pauses between keywords for reliable detection. Both the Carter and Kaya models are trained with office noises and work more reliably in environments with 20dB or more SNR levels.
The application plots the detected command ID in Sight. This plot is accessible in the Sight UI at
http://localhost:3000
for the desktop or http://ROBOTIP:3000
for Jetson.
The keyword probability plots can also be enabled from Sight UI.
To use the Kaya pre-trained model, replace data = ["@voice_command_detection_model_carter"]
with data = ["@voice_command_detection_model_kaya"]
in the BUILD file of the application
and replace the occurences of word carter with kaya in the application’s JSON file.
Platforms: Desktop, Jetson TX/2, Jetson Xavier, Jetson Nano
Hardware: Any microphone
To the use voice command detection feature in your own application, train a model with required keywords using the Voice Command Training application. The model and metadata file generated by voice command training application should be linked into your application configuration as outlined below:
Create 3 nodes for the 3 components: Voice Command Feature Extraction, Tensorflow Inference and Voice Command Construction.
Connect these 3 components in order. The input to this feature is connected to voice command feature extraction and output is obtained from voice command construction.
Configure the
audio_channel_index
andminimum_time_between_inferences
parameters in the voice command feature extraction component.Configure the
model_file_path
andconfig_file_path
parameters in the TensorFlow inference component.Configure the
command_list
,command_ids
andmax_frames_allowed_after_keyword_detected
parameters in the voice command construction component.The metadata file provides placeholders for the node names of each of the 3 codelets - Voice Command Feature Extraction, Tensorflow Inference and Voice Command Construction. Update these placeholders with the corresponding node names.
See Sample Application for information on using a single node instead of 3 different nodes.
Add the metadata file as a secondary configuration file to your application by using the config_files parameter in the application’s JSON file (see Sample Application) or by passing it on the command line as explained in Running an Application.
Note that the audio sample rate used in the Voice Command Training should match the sample rate of the incoming audio packets to this feature.