Voice Command Detection¶
The Voice Command Detection feature allows users to identify short commands from user’s speech. These commands are a sequence of keywords. The commands can be used to trigger corresponding actions. For example, the robot can detect the voice command “Carter, get popcorn” from the user and trigger the action of getting popcorn. This is a lightweight system which runs natively on Jetson platforms recognizing short commands constructed out of a limited set of keywords. This is different from a typical automatic speech recognition (ASR) system which can recognize a large vocabulary and generally requires significant system resources.
An NVIDIA deep-learning architecture is used to detect keywords by this feature. The model for the required keywords has to be trained using the Voice Command Training application.
Voice Command Feature Extraction¶
The Voice Command Feature Extraction codelet receives audio packets as input (for example, from the Audio Capture codelet). It extracts spectral features from the audio packets using DSP algorithms.
The extracted features are:
- Mel-frequency Cepstral Coefficients (MFCC)
- First order Delta of the MFCC
- Second order Delta of the MFCC
The computed features are normalized and stacked to form the output of this codelet.
|audio_channel_index||Input audio packets can be multi-channeled. This parameter specifies the channel index to be used for voice command detection.||0|
|minimum_time_between_inferences||Minimum duration (measured in seconds) between two consecutive keyword detection inferences. This value defines the frequency of detecting keywords. Range [0.1, 1.0]||0.1|
The parameters below are generated as a metadata file along with the model, after training. These should not vary from the trained configuration.
|sample_rate||Supported sample rate of the audio packets.||Int|
|fft_length||Length of Fourier transform window (as number of samples)||Int|
|num_mels||Number of mel bins to be extracted.||Int|
|num_mfcc||Number of Mel-frequency cepstral coefficients to be computed.||Int|
|start_coefficient||Index of the starting cepstral coefficient to be computed.||Int|
|hop_size||Stride for consecutive Fourier transform windows.||Int|
|window_length||Length of the window of audio packets which is used for keyword detection. This is the number of time frames after computing STFT with above parameters.||Int|
|mean||Mean feature map constructed from the training dataset.||List of Floats|
|sigma||Standard deviation of the feature map.||List of Floats|
Voice Command Construction¶
The Voice Command Construction codelet calls into the Command Constructor algorithm illustrated in image below. The algorithm takes a list of keyword probabilities (ideally from the inference output) at each tick and identifies a command over a period of time as shown in the following diagram:
|command_list||The list of commands which need to be detected. Each command is a string of keywords separated by spaces. All commands should start with the same keyword. Only the trained keywords should be used in the commands.||List of Strings|
The IDs associated with commands listed above. This is a 1:1 mapping with command_list. This parameter should have the same number of IDs as the number of commands in command_list.
Each command has to be assigned an ID, present in the output message of the Voice Command Construction codelet when that specific command is detected. This ID can be used to trigger an action by the module receiving this message. The IDs need not be unique. For example, two commands ‘carter bring popcorn’ and ‘carter get popcorn’ could represent same action and have same command ID.
|List of Ints|
|max_frames_allowed_after_keyword_detected||Maximum number of audio windows to wait for a defined command after the trigger keyword is detected.||Int|
The parameters below are generated as a metadata file along with model after training. These should not vary from the trained configuration.
|num_classes||The number of keywords.||Int|
|classes||The list of classes/keywords in the same order as those in the output of model inference.||List of Strings|
|thresholds||The probability thresholds per class/keyword.||List of Floats|
The keyword probabilities received as input are normalized after applying thresholds. These
normalized probabilities are available in Sight as
p_<keyword>. For example, probability of
the keyword Carter will be available as
The detected command ID is available in Sight as
The voice command detection sample application demonstrates the voice command detection feature with a sample pre-trained model. This model is trained using partial Google Speech Commands dataset. This pre-trained model supports below list of keywords:
These keywords can be used in any combination to form a list of commands. It is to be noted that all commands should start with same keyword. This application is configured to detect the following commands:
- Marvin stop sheila
- Marvin stop on zero
- Marvin right sheila
- Marvin up four
- Marvin down nine
- Marvin left down
- Marvin off zero
The application has Audio capture, Voice command feature extraction, Tensorflow inference and Voice command construction components connected in the same order. The audio capture component is configured for a 6-channel microphone array with audio captured at 16kHz. The voice command feature extraction component is configure to use the first channel (0-index) of the audio packets for command detection. The pre-trained model is has been trained for 16kHz audio and hence the sample_rate of audio capture component should match it for accurate detection.
To use the application, connect a microphone to the host/device and set it as default audio capture
device in the system settings. Set the capture volume of the microphone to 100%. Configure the audio
capture component (
num_channels) and the voice command feature extraction component
audio_channel_index) to match the specifications of the connected microphone. Run the
application and wait until the initialization of all the components is complete. The log message
“Listening for command” is printed on the console once the application is ready to detect commands.
Whenever a command is detected, the command ID and the command itself are printed on the console.
It is important to speak slowly and clearly with minor pauses between keywords for reliable
The application also plots the detected command ID. This plot is accessible in the Sight UI at
http://localhost:3000 for the desktop or
http://ROBOTIP:3000 for Jetson.
Platforms: Desktop, Jetson TX/2, Jetson Xavier, Jetson Nano
Hardware: Any microphone
Creating Your Own Application¶
To the use voice command detection feature in your own application, train a model with required keywords using the Voice Command Training application. The model and metadata file generated by voice command training application should be linked into your application configuration as outlined below:
Connect these 3 components in order. The input to this feature is connected to voice command feature extraction and output is obtained from voice command construction.
minimum_time_between_inferencesparameters in the voice command feature extraction component.
config_file_pathparameters in the TensorFlow inference component.
max_frames_allowed_after_keyword_detectedparameters in the voice command construction component.
The metadata file provides placeholders for the node names of each of the 3 codelets - Voice Command Feature Extraction, Tensorflow Inference and Voice Command Construction. Update these placeholders with the corresponding node names.
See Sample Application for information on using a single node instead of 3 different nodes.
Add the metadata file as a secondary configuration file to your application by using the config_files parameter in the application’s JSON file (see Sample Application) or by passing it on the command line as explained in Running an Application.
Note that the audio sample rate used in the Voice Command Training should match the sample rate of the incoming audio packets to this feature.