Voice Command Detection

The Voice Command Detection feature allows users to identify short commands from user’s speech. These commands are a sequence of keywords. The commands can be used to trigger corresponding actions. For example, the robot can detect the voice command “Carter, get popcorn” from the user and trigger the action of getting popcorn. This is a lightweight system which runs natively on Jetson platforms recognizing short commands constructed out of a limited set of keywords. This is different from a typical automatic speech recognition (ASR) system which can recognize a large vocabulary and generally requires significant system resources.

This feature comprises of a node with 3 codelets: Voice Command Feature Extraction, Tensorflow Inference and Voice Command Construction.

An NVIDIA deep-learning architecture is used to detect keywords by this feature. The model for the required keywords has to be trained using the Voice Command Training application.

Voice Command Feature Extraction

The Voice Command Feature Extraction codelet receives audio packets as input (for example, from the Audio Capture codelet). It extracts spectral features from the audio packets using DSP algorithms.

The extracted features are:

Mel spectrogram
First order Delta of the Mel spectrogram
Second order Delta of the Mel spectrogram

The computed features are normalized and stacked to form the output of this codelet.

Configuration Parameters

Parameter	Description	Default
audio_channel_index	Input audio packets can be multi-channeled. This parameter specifies the channel index to be used for voice command detection.	0
minimum_time_between_inferences	Minimum duration (measured in seconds) between two consecutive keyword detection inferences. This value defines the frequency of detecting keywords. Range [0.1, 1.0]	0.1

The parameters below are generated as a metadata file along with the model, after training. These should not vary from the trained configuration.

Parameter	Description	Type
sample_rate	Supported sample rate of the audio packets.	Int
fft_length	Length of Fourier transform window (as number of samples)	Int
num_mels	Number of mel bins to be extracted.	Int
hop_size	Stride for consecutive Fourier transform windows.	Int
window_length	Length of the window of audio packets which is used for keyword detection. This is the number of time frames after computing STFT with above parameters.	Int

Messages

Message	Proto Type	Name
Input	AudioDataProto	audio_packets
Output	TensorListProto	feature_tensors

Voice Command Construction

The Voice Command Construction codelet calls into the Command Constructor algorithm illustrated in image below. The algorithm takes a list of keyword probabilities (ideally from the inference output) at each tick and identifies a command over a period of time as shown in the following diagram:

Configuration Parameters

Parameter	Description	Type
command_list	The list of commands which need to be detected. Each command is a string of keywords separated by spaces. All commands should start with the same keyword. Only the trained keywords should be used in the commands.	List of Strings
command_ids	The IDs associated with commands listed above. This is a 1:1 mapping with command_list. This parameter should have the same number of IDs as the number of commands in command_list. Each command has to be assigned an ID, present in the output message of the Voice Command Construction codelet when that specific command is detected. This ID can be used to trigger an action by the module receiving this message. The IDs need not be unique. For example, two commands ‘carter bring popcorn’ and ‘carter get popcorn’ could represent same action and have same command ID.	List of Ints
max_frames_allowed_after_keyword_detected	Maximum number of audio windows to wait for a defined command after the trigger keyword is detected.	Int
probability_mean_window	Window size over which the keyword probability predictions are averaged.	Int

The parameters below are generated as a metadata file along with model after training. These should not vary from the trained configuration.

Parameter	Description	Type
num_classes	The number of keywords.	Int
classes	The list of classes/keywords in the same order as those in the output of model inference.	List of Strings
thresholds	The probability thresholds per class/keyword.	List of Floats

Messages

Message	Proto Type	Name
Input	TensorListProto	feature_tensors
Output	VoiceCommandDetectionProto	detected_command

Sight Variables

The keyword probabilities received as input are available in Sight as original_probabilities.<keyword>. The values after normalizing these probabilities over a window are available as mean_probabilities.<keyword>. Thresholds are applied on the normalized thresholds. These are available in Sight as thresholded_probabilities.<keyword>. For example, probabilities of the keyword Carter are available as original_probabilities.carter, mean_probabilities.carter and thresholded_probabilities.carter.

The accumulated probabilities of each command are available in Sight as accumulated_probabilities.<command>. For example, the probability for the command “carter stop” is available as accumulated_probabilities.carter stop.

The detected command ID is available in Sight as voice_command_id.

Sample Application

The voice command detection sample application demonstrates the voice command detection feature with a sample pre-trained model. This application is packaged with two models: one for each Carter and Kaya. These model are trained using a custom recorded dataset of US accent speeches. The Carter pre-trained model supports below list of keywords:

carter	go	stop	ready
yes	no	help	tesla

The Kaya pre-trained model also supports all the above keywords except for carter which is replaced by kaya.

These keywords can be used in any combination to form a list of commands. This application is configured to detect the following commands:

Carter
Carter go
Carter Stop
Carter Yes
Carter No
Carter Ready
Carter Help
Carter Tesla

The application has Audio capture, Voice command feature extraction, Tensorflow inference and Voice command construction components connected in the same order. The audio capture component is configured for a 6-channel microphone array with audio captured at 16kHz. The voice command feature extraction component is configure to use the first channel (0-index) of the audio packets for command detection. The pre-trained model is has been trained for 16kHz audio and hence the sample_rate of audio capture component should match it for accurate detection.

To use the application, connect a microphone to the host/device and set it as default audio capture device in the system settings. Set the capture volume of the microphone to 100%. Configure the audio capture component (num_channels) and the voice command feature extraction component (audio_channel_index) to match the specifications of the connected microphone. Run the application and wait until the initialization of all the components is complete. The log message “Listening for command” is printed on the console once the application is ready to detect commands.

It is important to speak slowly and clearly with minor pauses between keywords for reliable detection. Both the Carter and Kaya models are trained with office noises and work more reliably in environments with 20dB or more SNR levels.

The application plots the detected command ID in Sight. This plot is accessible in the Sight UI at http://localhost:3000 for the desktop or http://ROBOTIP:3000 for Jetson.

The keyword probability plots can also be enabled from Sight UI.

To use the Kaya pre-trained model, replace data = ["@voice_command_detection_model_carter"] with data = ["@voice_command_detection_model_kaya"] in the BUILD file of the application and replace the occurences of word carter with kaya in the application’s JSON file.

Platforms: Desktop, Jetson TX/2, Jetson Xavier, Jetson Nano

Hardware: Any microphone

Creating Your Own Application

To the use voice command detection feature in your own application, train a model with required keywords using the Voice Command Training application. The model and metadata file generated by voice command training application should be linked into your application configuration as outlined below:

Create 3 nodes for the 3 components: Voice Command Feature Extraction, Tensorflow Inference and Voice Command Construction.
Connect these 3 components in order. The input to this feature is connected to voice command feature extraction and output is obtained from voice command construction.
Configure the audio_channel_index and minimum_time_between_inferences parameters in the voice command feature extraction component.
Configure the model_file_path and config_file_path parameters in the TensorFlow inference component.
Configure the command_list, command_ids and max_frames_allowed_after_keyword_detected parameters in the voice command construction component.
The metadata file provides placeholders for the node names of each of the 3 codelets - Voice Command Feature Extraction, Tensorflow Inference and Voice Command Construction. Update these placeholders with the corresponding node names.

See Sample Application for information on using a single node instead of 3 different nodes.
Add the metadata file as a secondary configuration file to your application by using the config_files parameter in the application’s JSON file (see Sample Application) or by passing it on the command line as explained in Running an Application.
Note that the audio sample rate used in the Voice Command Training should match the sample rate of the incoming audio packets to this feature.