Natural Language Processing¶

Significant advances in the NLP field have been made over the past few years with most of the advances sharing one common thread: dramatically larger models trained on more data. BERT-large, for example, has 340 million parameters and GPT-2 has 1.5 billion parameters. Models of this size make inference tasks on a CPU impractical today, necessitating a scalable inference framework for NLP tasks on a GPU.

Riva Natural Language Processing (NLP) is a flexible sequence classification and sequence labeling toolkit. It takes text as input and performs a number of analysis algorithms, such as named entity recognition, intent classification, punctuation, and translation. Riva NLP is built based on common text processing models that can be adapted for multiple common NLP tasks.

Riva NLP enables the fast deployment of new task-specific NLP models without requiring additional development time for deployment.

Model Architectures¶

Bidirectional Encoder Representations from Transformers (BERT)¶

By pretraining a model like BERT in an unsupervised fashion, NLP practitioners are able to create application-specific models by simply adding a different “head” (or output layer) to the model and fine-tune the augmented model with in-domain data for the desired task. Riva NLP enables deployment of models trained in this manner.

Riva NLP supports models which are BERT-based. Google’s BERT (Bidirectional Encoder Representations from Transformers) is, as the name implies, a transformer-based language model. After pre-trained, adding a single layer as necessary for the downstream task allows the model to be fine-tuned and achieve state-of-the-art results (at the time) across a wide variety of disparate NLP tasks. While new models have built on BERT’s success, its relative simplicity, parameter count, and good task-specific performance, make it a compelling choice for a latency-sensitive NLP deployment. Most fine-tuning tasks can run in a few hours on a single GPU. For more information about BERT, refer to the BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding paper.

Megatron¶

Megatron is a transformer model architecture inspired by BERT that is designed to scale up to billions of parameters. When training NLP models for deployment with Riva, you can select between standard BERT and Megatron. For more details on Megatron, refer to the Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism paper.

Services¶

The Riva Natural Language API exposes task-specific functions for popular NLP tasks, including question answering, entity recognition, intent recognition, and slot-filling as well as more generic NLP pipelines for custom use cases. The intent of this service is to allow you to design models for arbitrary use cases that conform simply with input and output types specified in the service. As an explicit example, the ClassifyText function could be used for sentiment classification, domain recognition, language identification, etc.

Pipeline Configuration¶

Riva NLP supports 5 different NLP tasks:

Text Classification
Token Classification (Named Entity Recognition)
Joint Intent and Slots
Question Answering (Extractive)
Punctuation and Capitalization

Riva NLP enables you to deploy your own custom trained NLP models from TAO Toolkit via riva-build and riva-deploy. The exported model from TAO Toolkit comes as an .riva file which is essentially a tar file consisting of the checkpoints in either ONNX or .pt (PyTorch) format along with their necessary model configuration. This .riva file takes as input into the riva-build along with the pipeline task name to produce the deployment specific RMIR (Riva Model Intermediate Representation).

The three NLP classification tasks (i.e. token_classification, intent_slot, and text_classification) support an optional parameter called --domain_name that enables you to name your custom models. This is useful if you plan to deploy multiple models of the same task.

For the task of intent_slot, Riva also supports a parameter called --contextual that enables you to specify whether the model you are using is contextual or not. If --contextual is set to true, the Riva server prepends to the input query the previous intent if there is one or intent_none otherwise. Else, the Riva server prepends anything to the input query. By default, Riva will set this field’s value to true.

Each of the tasks support a set of command-line arguments that enables you to set your configuration settings via the CLI. Use the format riva-build <task name> -h to see the list of available CLI inputs for each task. If your .riva archives are encrypted, you need to include :<encryption_key> at the end of the RMIR filename and riva filename. Otherwise, this is unnecessary.

Text Classification¶

Text Classification is the task of classifying any given input text (or sequence of tokens) into a predefined set of classes. You can use this generic text classification task for various domain specific use cases such as sentiment classification, topic classification, intent classification, domain classification etc. To build the intermediate representation (RMIR) for text classification, run:

riva-build text_classification \
    --domain_name=”<your custom domain name>” \
    /servicemaker-dev/<rmir_filename>:<encryption_key> \
    /servicemaker-dev/<riva_filename>:<encryption_key>

For more information, refer to the Text Classification notebook, which is an end-to-end workflow for text classification. This workflow starts with training in TAO Toolkit and ends with deployment using Riva.

Token Classification (Named Entity Recognition)¶

Given a sequence of tokens, or a sentence, Token Classification is the task of classifying every input token into a corresponding output class. As an example, a token can be classified as a person, location, product, etc. To build the intermediate representation (RMIR) for token classification, run:

riva-build token_classification \
    --domain_name=”<your custom domain name>” \
    /servicemaker-dev/<rmir_filename>:<encryption_key> \
    /servicemaker-dev/<riva_filename>:<encryption_key>

For more information, refer to the Token Classification notebook; an end-to-end workflow for NER. This workflow starts with training in TAO Toolkit and ends with deployment using Riva.

Joint Intent and Slots¶

Joint Intent and Slot Classification is a task of classifying an intent and detecting all relevant slots (entities) for this intent in a query. For example, in the query: “What is the weather in Santa Clara tomorrow morning?”, we would like to classify the query as a “weather” intent, and detect “Santa Clara” as a location slot and “tomorrow morning” as a date_time slot. Intents and slot names are usually task specific and defined as labels in the training data. This is a common task in the building of domain specific task oriented chatbots.

riva-build intent_slot \
    --domain_name=”<your custom domain name>” \
    /servicemaker-dev/<rmir_filename>:<encryption_key> \
    /servicemaker-dev/<riva_filename>:<encryption_key>

For more information, refer to the Intent Slot Classification notebook; an end-to-end workflow for intent slot classification. This workflow starts with training in TAO Toolkit and ends with deployment using Riva.

Question Answering (Extractive)¶

Extractive Question Answering, also known as Reading Comprehension, is a task where the inputs are a passage and a question and the model needs to find the best answer span in the given passage, if it exists. The most common dataset for Extractive Question Answering is the SQuAD dataset, however, you can fine-tune your own set of Question Answering dataset and deploy on Riva. To build the intermediate representation (RMIR) for Question Answering, run:

riva-build qa \
    /servicemaker-dev/<rmir_filename>:<encryption_key> \
    /servicemaker-dev/<riva_filename>:<encryption_key>

For more information, refer to the Question Answering notebook; an end-to-end workflow for question answering. This workflow starts with training in TAO Toolkit and ends with deployment using Riva.

Punctuation and Capitalization¶

Given a text devoid of any punctuation, the task of Punctuation and Capitalization fills in the necessary punctuations at the appropriate places and capitalizes as necessary after the punctuations are filled in. As an example, given the text - “what are the best restaurants in san jose” will be converted to “What are the best restaurants in San Jose?”. This task is also used at the end of a Speech Recognition pipeline to augment its output with necessary punctuations.

riva-build punctuation \
    /servicemaker-dev/<rmir_filename>:<encryption_key> \
    /servicemaker-dev/<riva_filename>:<encryption_key>

For more information, refer to the Punctuation Capitalization notebook; an end-to-end workflow for punctuation and capitalization. This workflow starts with training in TAO Toolkit and ends with deployment using Riva.

Riva-build Optional Parameters¶

For details about the parameters passed to riva-build to customize the NLP pipelines, issue:

riva-build <nlp_pipeline> -h

where <nlp_pipeline> is one of text_classification, token_classification, intent_slot, qa or punctuation. The following list includes descriptions for all optional parameters currently recognized by riva-build:

usage: riva-build text_classification [-h] [-f]
                                      [--language_code LANGUAGE_CODE]
                                      [--max_batch_size MAX_BATCH_SIZE]
                                      [--model_api MODEL_API]
                                      [--domain_name DOMAIN_NAME]
                                      [--start_token START_TOKEN]
                                      [--end_token END_TOKEN] [--tokenizer]
                                      [--vocab VOCAB]
                                      [--tokenizer.max_sequence_idle_microseconds TOKENIZER.MAX_SEQUENCE_IDLE_MICROSECONDS]
                                      [--tokenizer.max_batch_size TOKENIZER.MAX_BATCH_SIZE]
                                      [--tokenizer.min_batch_size TOKENIZER.MIN_BATCH_SIZE]
                                      [--tokenizer.opt_batch_size TOKENIZER.OPT_BATCH_SIZE]
                                      [--tokenizer.preferred_batch_size TOKENIZER.PREFERRED_BATCH_SIZE]
                                      [--tokenizer.batching_type TOKENIZER.BATCHING_TYPE]
                                      [--tokenizer.preserve_ordering TOKENIZER.PRESERVE_ORDERING]
                                      [--tokenizer.instance_group_count TOKENIZER.INSTANCE_GROUP_COUNT]
                                      [--tokenizer.max_queue_delay_microseconds TOKENIZER.MAX_QUEUE_DELAY_MICROSECONDS]
                                      [--tokenizer.unk_token TOKENIZER.UNK_TOKEN]
                                      [--tokenizer.to_lower TOKENIZER.TO_LOWER]
                                      [--tokenizer.bos_token TOKENIZER.BOS_TOKEN]
                                      [--tokenizer.eos_token TOKENIZER.EOS_TOKEN]
                                      [--language_model.max_sequence_idle_microseconds LANGUAGE_MODEL.MAX_SEQUENCE_IDLE_MICROSECONDS]
                                      [--language_model.max_batch_size LANGUAGE_MODEL.MAX_BATCH_SIZE]
                                      [--language_model.min_batch_size LANGUAGE_MODEL.MIN_BATCH_SIZE]
                                      [--language_model.opt_batch_size LANGUAGE_MODEL.OPT_BATCH_SIZE]
                                      [--language_model.preferred_batch_size LANGUAGE_MODEL.PREFERRED_BATCH_SIZE]
                                      [--language_model.batching_type LANGUAGE_MODEL.BATCHING_TYPE]
                                      [--language_model.preserve_ordering LANGUAGE_MODEL.PRESERVE_ORDERING]
                                      [--language_model.instance_group_count LANGUAGE_MODEL.INSTANCE_GROUP_COUNT]
                                      [--language_model.max_queue_delay_microseconds LANGUAGE_MODEL.MAX_QUEUE_DELAY_MICROSECONDS]
                                      [--language_model.trt_max_workspace_size LANGUAGE_MODEL.TRT_MAX_WORKSPACE_SIZE]
                                      output_path source_path
                                      [source_path ...]

Generate a Riva Model from a text_classification model trained with NVIDIA
NeMo.

positional arguments:
  output_path           Location to write compiled Riva pipeline
  source_path           Source file(s)

optional arguments:
  -h, --help            show this help message and exit
  -f, --force           Overwrite existing artifacts if they exist
  --language_code LANGUAGE_CODE
                        Language of the model
  --max_batch_size MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --model_api MODEL_API
  --domain_name DOMAIN_NAME
                        Add the domain name for your custom Text Classifcation
                        model. This will be used to identify the model during
                        inference
  --start_token START_TOKEN
  --end_token END_TOKEN
  --tokenizer           Tokenizer type
  --vocab VOCAB         Path for the Language Model vocab file. If not
                        specified, default will be selected which is BERT base
                        vocab

tokenizer:
  --tokenizer.max_sequence_idle_microseconds TOKENIZER.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --tokenizer.max_batch_size TOKENIZER.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --tokenizer.min_batch_size TOKENIZER.MIN_BATCH_SIZE
  --tokenizer.opt_batch_size TOKENIZER.OPT_BATCH_SIZE
  --tokenizer.preferred_batch_size TOKENIZER.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --tokenizer.batching_type TOKENIZER.BATCHING_TYPE
  --tokenizer.preserve_ordering TOKENIZER.PRESERVE_ORDERING
                        Preserve ordering
  --tokenizer.instance_group_count TOKENIZER.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --tokenizer.max_queue_delay_microseconds TOKENIZER.MAX_QUEUE_DELAY_MICROSECONDS
                        Maximum amount of time to allow requests to queue to
                        form a batch in microseconds
  --tokenizer.unk_token TOKENIZER.UNK_TOKEN
  --tokenizer.to_lower TOKENIZER.TO_LOWER
  --tokenizer.bos_token TOKENIZER.BOS_TOKEN
  --tokenizer.eos_token TOKENIZER.EOS_TOKEN

language_model:
  --language_model.max_sequence_idle_microseconds LANGUAGE_MODEL.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --language_model.max_batch_size LANGUAGE_MODEL.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --language_model.min_batch_size LANGUAGE_MODEL.MIN_BATCH_SIZE
  --language_model.opt_batch_size LANGUAGE_MODEL.OPT_BATCH_SIZE
  --language_model.preferred_batch_size LANGUAGE_MODEL.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --language_model.batching_type LANGUAGE_MODEL.BATCHING_TYPE
  --language_model.preserve_ordering LANGUAGE_MODEL.PRESERVE_ORDERING
                        Preserve ordering
  --language_model.instance_group_count LANGUAGE_MODEL.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --language_model.max_queue_delay_microseconds LANGUAGE_MODEL.MAX_QUEUE_DELAY_MICROSECONDS
                        Maximum amount of time to allow requests to queue to
                        form a batch in microseconds
  --language_model.trt_max_workspace_size LANGUAGE_MODEL.TRT_MAX_WORKSPACE_SIZE
                        Maximum workspace size (in Mb) to use for model export
                        to TensorRT

usage: riva-build token_classification [-h] [-f]
                                       [--language_code LANGUAGE_CODE]
                                       [--max_batch_size MAX_BATCH_SIZE]
                                       [--model_api MODEL_API]
                                       [--domain_name DOMAIN_NAME]
                                       [--max_seq_length MAX_SEQ_LENGTH]
                                       [--IOB IOB]
                                       [--ignore_class IGNORE_CLASS]
                                       [--start_token START_TOKEN]
                                       [--end_token END_TOKEN] [--tokenizer]
                                       [--vocab VOCAB] [--to_lower TO_LOWER]
                                       [--tokenizer.max_sequence_idle_microseconds TOKENIZER.MAX_SEQUENCE_IDLE_MICROSECONDS]
                                       [--tokenizer.max_batch_size TOKENIZER.MAX_BATCH_SIZE]
                                       [--tokenizer.min_batch_size TOKENIZER.MIN_BATCH_SIZE]
                                       [--tokenizer.opt_batch_size TOKENIZER.OPT_BATCH_SIZE]
                                       [--tokenizer.preferred_batch_size TOKENIZER.PREFERRED_BATCH_SIZE]
                                       [--tokenizer.batching_type TOKENIZER.BATCHING_TYPE]
                                       [--tokenizer.preserve_ordering TOKENIZER.PRESERVE_ORDERING]
                                       [--tokenizer.instance_group_count TOKENIZER.INSTANCE_GROUP_COUNT]
                                       [--tokenizer.max_queue_delay_microseconds TOKENIZER.MAX_QUEUE_DELAY_MICROSECONDS]
                                       [--tokenizer.unk_token TOKENIZER.UNK_TOKEN]
                                       [--tokenizer.to_lower TOKENIZER.TO_LOWER]
                                       [--tokenizer.bos_token TOKENIZER.BOS_TOKEN]
                                       [--tokenizer.eos_token TOKENIZER.EOS_TOKEN]
                                       [--language_model.max_sequence_idle_microseconds LANGUAGE_MODEL.MAX_SEQUENCE_IDLE_MICROSECONDS]
                                       [--language_model.max_batch_size LANGUAGE_MODEL.MAX_BATCH_SIZE]
                                       [--language_model.min_batch_size LANGUAGE_MODEL.MIN_BATCH_SIZE]
                                       [--language_model.opt_batch_size LANGUAGE_MODEL.OPT_BATCH_SIZE]
                                       [--language_model.preferred_batch_size LANGUAGE_MODEL.PREFERRED_BATCH_SIZE]
                                       [--language_model.batching_type LANGUAGE_MODEL.BATCHING_TYPE]
                                       [--language_model.preserve_ordering LANGUAGE_MODEL.PRESERVE_ORDERING]
                                       [--language_model.instance_group_count LANGUAGE_MODEL.INSTANCE_GROUP_COUNT]
                                       [--language_model.max_queue_delay_microseconds LANGUAGE_MODEL.MAX_QUEUE_DELAY_MICROSECONDS]
                                       [--language_model.trt_max_workspace_size LANGUAGE_MODEL.TRT_MAX_WORKSPACE_SIZE]
                                       [--label_tokens.max_sequence_idle_microseconds LABEL_TOKENS.MAX_SEQUENCE_IDLE_MICROSECONDS]
                                       [--label_tokens.max_batch_size LABEL_TOKENS.MAX_BATCH_SIZE]
                                       [--label_tokens.min_batch_size LABEL_TOKENS.MIN_BATCH_SIZE]
                                       [--label_tokens.opt_batch_size LABEL_TOKENS.OPT_BATCH_SIZE]
                                       [--label_tokens.preferred_batch_size LABEL_TOKENS.PREFERRED_BATCH_SIZE]
                                       [--label_tokens.batching_type LABEL_TOKENS.BATCHING_TYPE]
                                       [--label_tokens.preserve_ordering LABEL_TOKENS.PRESERVE_ORDERING]
                                       [--label_tokens.instance_group_count LABEL_TOKENS.INSTANCE_GROUP_COUNT]
                                       [--label_tokens.max_queue_delay_microseconds LABEL_TOKENS.MAX_QUEUE_DELAY_MICROSECONDS]
                                       [--label_tokens.classes LABEL_TOKENS.CLASSES]
                                       [--detokenizer.max_sequence_idle_microseconds DETOKENIZER.MAX_SEQUENCE_IDLE_MICROSECONDS]
                                       [--detokenizer.max_batch_size DETOKENIZER.MAX_BATCH_SIZE]
                                       [--detokenizer.min_batch_size DETOKENIZER.MIN_BATCH_SIZE]
                                       [--detokenizer.opt_batch_size DETOKENIZER.OPT_BATCH_SIZE]
                                       [--detokenizer.preferred_batch_size DETOKENIZER.PREFERRED_BATCH_SIZE]
                                       [--detokenizer.batching_type DETOKENIZER.BATCHING_TYPE]
                                       [--detokenizer.preserve_ordering DETOKENIZER.PRESERVE_ORDERING]
                                       [--detokenizer.instance_group_count DETOKENIZER.INSTANCE_GROUP_COUNT]
                                       [--detokenizer.max_queue_delay_microseconds DETOKENIZER.MAX_QUEUE_DELAY_MICROSECONDS]
                                       output_path source_path
                                       [source_path ...]

Generate a Riva Model from a token_classification model trained with NVIDIA
NeMo.

positional arguments:
  output_path           Location to write compiled Riva pipeline
  source_path           Source file(s)

optional arguments:
  -h, --help            show this help message and exit
  -f, --force           Overwrite existing artifacts if they exist
  --language_code LANGUAGE_CODE
                        Language of the model
  --max_batch_size MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --model_api MODEL_API
  --domain_name DOMAIN_NAME
                        Add the domain name for your custom Entity
                        classification model. This will be used to identify
                        the model during inference
  --max_seq_length MAX_SEQ_LENGTH
                        Max no. of tokens for an input query
  --IOB IOB
  --ignore_class IGNORE_CLASS
  --start_token START_TOKEN
  --end_token END_TOKEN
  --tokenizer           Tokenizer type
  --vocab VOCAB         Path for the Language Model vocab file. If not
                        specified, default will be selected which is BERT base
                        vocab
  --to_lower TO_LOWER

tokenizer:
  --tokenizer.max_sequence_idle_microseconds TOKENIZER.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --tokenizer.max_batch_size TOKENIZER.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --tokenizer.min_batch_size TOKENIZER.MIN_BATCH_SIZE
  --tokenizer.opt_batch_size TOKENIZER.OPT_BATCH_SIZE
  --tokenizer.preferred_batch_size TOKENIZER.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --tokenizer.batching_type TOKENIZER.BATCHING_TYPE
  --tokenizer.preserve_ordering TOKENIZER.PRESERVE_ORDERING
                        Preserve ordering
  --tokenizer.instance_group_count TOKENIZER.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --tokenizer.max_queue_delay_microseconds TOKENIZER.MAX_QUEUE_DELAY_MICROSECONDS
                        Maximum amount of time to allow requests to queue to
                        form a batch in microseconds
  --tokenizer.unk_token TOKENIZER.UNK_TOKEN
  --tokenizer.to_lower TOKENIZER.TO_LOWER
  --tokenizer.bos_token TOKENIZER.BOS_TOKEN
  --tokenizer.eos_token TOKENIZER.EOS_TOKEN

language_model:
  --language_model.max_sequence_idle_microseconds LANGUAGE_MODEL.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --language_model.max_batch_size LANGUAGE_MODEL.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --language_model.min_batch_size LANGUAGE_MODEL.MIN_BATCH_SIZE
  --language_model.opt_batch_size LANGUAGE_MODEL.OPT_BATCH_SIZE
  --language_model.preferred_batch_size LANGUAGE_MODEL.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --language_model.batching_type LANGUAGE_MODEL.BATCHING_TYPE
  --language_model.preserve_ordering LANGUAGE_MODEL.PRESERVE_ORDERING
                        Preserve ordering
  --language_model.instance_group_count LANGUAGE_MODEL.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --language_model.max_queue_delay_microseconds LANGUAGE_MODEL.MAX_QUEUE_DELAY_MICROSECONDS
                        Maximum amount of time to allow requests to queue to
                        form a batch in microseconds
  --language_model.trt_max_workspace_size LANGUAGE_MODEL.TRT_MAX_WORKSPACE_SIZE
                        Maximum workspace size (in Mb) to use for model export
                        to TensorRT

label_tokens:
  --label_tokens.max_sequence_idle_microseconds LABEL_TOKENS.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --label_tokens.max_batch_size LABEL_TOKENS.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --label_tokens.min_batch_size LABEL_TOKENS.MIN_BATCH_SIZE
  --label_tokens.opt_batch_size LABEL_TOKENS.OPT_BATCH_SIZE
  --label_tokens.preferred_batch_size LABEL_TOKENS.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --label_tokens.batching_type LABEL_TOKENS.BATCHING_TYPE
  --label_tokens.preserve_ordering LABEL_TOKENS.PRESERVE_ORDERING
                        Preserve ordering
  --label_tokens.instance_group_count LABEL_TOKENS.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --label_tokens.max_queue_delay_microseconds LABEL_TOKENS.MAX_QUEUE_DELAY_MICROSECONDS
                        Maximum amount of time to allow requests to queue to
                        form a batch in microseconds
  --label_tokens.classes LABEL_TOKENS.CLASSES

detokenizer:
  --detokenizer.max_sequence_idle_microseconds DETOKENIZER.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --detokenizer.max_batch_size DETOKENIZER.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --detokenizer.min_batch_size DETOKENIZER.MIN_BATCH_SIZE
  --detokenizer.opt_batch_size DETOKENIZER.OPT_BATCH_SIZE
  --detokenizer.preferred_batch_size DETOKENIZER.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --detokenizer.batching_type DETOKENIZER.BATCHING_TYPE
  --detokenizer.preserve_ordering DETOKENIZER.PRESERVE_ORDERING
                        Preserve ordering
  --detokenizer.instance_group_count DETOKENIZER.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --detokenizer.max_queue_delay_microseconds DETOKENIZER.MAX_QUEUE_DELAY_MICROSECONDS
                        Maximum amount of time to allow requests to queue to
                        form a batch in microseconds

usage: riva-build intent_slot [-h] [-f] [--language_code LANGUAGE_CODE]
                              [--max_batch_size MAX_BATCH_SIZE]
                              [--domain_name DOMAIN_NAME]
                              [--model_api MODEL_API] [--IOB IOB]
                              [--ignore_class IGNORE_CLASS]
                              [--start_token START_TOKEN]
                              [--end_token END_TOKEN]
                              [--contextual CONTEXTUAL] [--tokenizer]
                              [--vocab VOCAB]
                              [--max_seq_length MAX_SEQ_LENGTH]
                              [--tokenizer.max_sequence_idle_microseconds TOKENIZER.MAX_SEQUENCE_IDLE_MICROSECONDS]
                              [--tokenizer.max_batch_size TOKENIZER.MAX_BATCH_SIZE]
                              [--tokenizer.min_batch_size TOKENIZER.MIN_BATCH_SIZE]
                              [--tokenizer.opt_batch_size TOKENIZER.OPT_BATCH_SIZE]
                              [--tokenizer.preferred_batch_size TOKENIZER.PREFERRED_BATCH_SIZE]
                              [--tokenizer.batching_type TOKENIZER.BATCHING_TYPE]
                              [--tokenizer.preserve_ordering TOKENIZER.PRESERVE_ORDERING]
                              [--tokenizer.instance_group_count TOKENIZER.INSTANCE_GROUP_COUNT]
                              [--tokenizer.max_queue_delay_microseconds TOKENIZER.MAX_QUEUE_DELAY_MICROSECONDS]
                              [--tokenizer.unk_token TOKENIZER.UNK_TOKEN]
                              [--tokenizer.to_lower TOKENIZER.TO_LOWER]
                              [--tokenizer.bos_token TOKENIZER.BOS_TOKEN]
                              [--tokenizer.eos_token TOKENIZER.EOS_TOKEN]
                              [--language_model.max_sequence_idle_microseconds LANGUAGE_MODEL.MAX_SEQUENCE_IDLE_MICROSECONDS]
                              [--language_model.max_batch_size LANGUAGE_MODEL.MAX_BATCH_SIZE]
                              [--language_model.min_batch_size LANGUAGE_MODEL.MIN_BATCH_SIZE]
                              [--language_model.opt_batch_size LANGUAGE_MODEL.OPT_BATCH_SIZE]
                              [--language_model.preferred_batch_size LANGUAGE_MODEL.PREFERRED_BATCH_SIZE]
                              [--language_model.batching_type LANGUAGE_MODEL.BATCHING_TYPE]
                              [--language_model.preserve_ordering LANGUAGE_MODEL.PRESERVE_ORDERING]
                              [--language_model.instance_group_count LANGUAGE_MODEL.INSTANCE_GROUP_COUNT]
                              [--language_model.max_queue_delay_microseconds LANGUAGE_MODEL.MAX_QUEUE_DELAY_MICROSECONDS]
                              [--language_model.trt_max_workspace_size LANGUAGE_MODEL.TRT_MAX_WORKSPACE_SIZE]
                              [--label_tokens.max_sequence_idle_microseconds LABEL_TOKENS.MAX_SEQUENCE_IDLE_MICROSECONDS]
                              [--label_tokens.max_batch_size LABEL_TOKENS.MAX_BATCH_SIZE]
                              [--label_tokens.min_batch_size LABEL_TOKENS.MIN_BATCH_SIZE]
                              [--label_tokens.opt_batch_size LABEL_TOKENS.OPT_BATCH_SIZE]
                              [--label_tokens.preferred_batch_size LABEL_TOKENS.PREFERRED_BATCH_SIZE]
                              [--label_tokens.batching_type LABEL_TOKENS.BATCHING_TYPE]
                              [--label_tokens.preserve_ordering LABEL_TOKENS.PRESERVE_ORDERING]
                              [--label_tokens.instance_group_count LABEL_TOKENS.INSTANCE_GROUP_COUNT]
                              [--label_tokens.max_queue_delay_microseconds LABEL_TOKENS.MAX_QUEUE_DELAY_MICROSECONDS]
                              [--label_tokens.classes LABEL_TOKENS.CLASSES]
                              [--detokenizer.max_sequence_idle_microseconds DETOKENIZER.MAX_SEQUENCE_IDLE_MICROSECONDS]
                              [--detokenizer.max_batch_size DETOKENIZER.MAX_BATCH_SIZE]
                              [--detokenizer.min_batch_size DETOKENIZER.MIN_BATCH_SIZE]
                              [--detokenizer.opt_batch_size DETOKENIZER.OPT_BATCH_SIZE]
                              [--detokenizer.preferred_batch_size DETOKENIZER.PREFERRED_BATCH_SIZE]
                              [--detokenizer.batching_type DETOKENIZER.BATCHING_TYPE]
                              [--detokenizer.preserve_ordering DETOKENIZER.PRESERVE_ORDERING]
                              [--detokenizer.instance_group_count DETOKENIZER.INSTANCE_GROUP_COUNT]
                              [--detokenizer.max_queue_delay_microseconds DETOKENIZER.MAX_QUEUE_DELAY_MICROSECONDS]
                              output_path source_path [source_path ...]

Generate a Riva Model from a intent_slot model trained with NVIDIA NeMo.

positional arguments:
  output_path           Location to write compiled Riva pipeline
  source_path           Source file(s)

optional arguments:
  -h, --help            show this help message and exit
  -f, --force           Overwrite existing artifacts if they exist
  --language_code LANGUAGE_CODE
                        Language of the model
  --max_batch_size MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --domain_name DOMAIN_NAME
                        Add the domain name for your custom Intent Slot model.
                        This will be used to identify the model during
                        inference
  --model_api MODEL_API
  --IOB IOB
  --ignore_class IGNORE_CLASS
  --start_token START_TOKEN
  --end_token END_TOKEN
  --contextual CONTEXTUAL
  --tokenizer           Tokenizer type
  --vocab VOCAB         Path for the Language Model vocab file. If not
                        specified, default will be selected which is BERT base
                        vocab
  --max_seq_length MAX_SEQ_LENGTH
                        Default maximum sequence length used during finetuning
                        BERT model

tokenizer:
  --tokenizer.max_sequence_idle_microseconds TOKENIZER.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --tokenizer.max_batch_size TOKENIZER.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --tokenizer.min_batch_size TOKENIZER.MIN_BATCH_SIZE
  --tokenizer.opt_batch_size TOKENIZER.OPT_BATCH_SIZE
  --tokenizer.preferred_batch_size TOKENIZER.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --tokenizer.batching_type TOKENIZER.BATCHING_TYPE
  --tokenizer.preserve_ordering TOKENIZER.PRESERVE_ORDERING
                        Preserve ordering
  --tokenizer.instance_group_count TOKENIZER.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --tokenizer.max_queue_delay_microseconds TOKENIZER.MAX_QUEUE_DELAY_MICROSECONDS
                        Maximum amount of time to allow requests to queue to
                        form a batch in microseconds
  --tokenizer.unk_token TOKENIZER.UNK_TOKEN
  --tokenizer.to_lower TOKENIZER.TO_LOWER
  --tokenizer.bos_token TOKENIZER.BOS_TOKEN
  --tokenizer.eos_token TOKENIZER.EOS_TOKEN

language_model:
  --language_model.max_sequence_idle_microseconds LANGUAGE_MODEL.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --language_model.max_batch_size LANGUAGE_MODEL.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --language_model.min_batch_size LANGUAGE_MODEL.MIN_BATCH_SIZE
  --language_model.opt_batch_size LANGUAGE_MODEL.OPT_BATCH_SIZE
  --language_model.preferred_batch_size LANGUAGE_MODEL.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --language_model.batching_type LANGUAGE_MODEL.BATCHING_TYPE
  --language_model.preserve_ordering LANGUAGE_MODEL.PRESERVE_ORDERING
                        Preserve ordering
  --language_model.instance_group_count LANGUAGE_MODEL.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --language_model.max_queue_delay_microseconds LANGUAGE_MODEL.MAX_QUEUE_DELAY_MICROSECONDS
                        Maximum amount of time to allow requests to queue to
                        form a batch in microseconds
  --language_model.trt_max_workspace_size LANGUAGE_MODEL.TRT_MAX_WORKSPACE_SIZE
                        Maximum workspace size (in Mb) to use for model export
                        to TensorRT

label_tokens:
  --label_tokens.max_sequence_idle_microseconds LABEL_TOKENS.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --label_tokens.max_batch_size LABEL_TOKENS.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --label_tokens.min_batch_size LABEL_TOKENS.MIN_BATCH_SIZE
  --label_tokens.opt_batch_size LABEL_TOKENS.OPT_BATCH_SIZE
  --label_tokens.preferred_batch_size LABEL_TOKENS.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --label_tokens.batching_type LABEL_TOKENS.BATCHING_TYPE
  --label_tokens.preserve_ordering LABEL_TOKENS.PRESERVE_ORDERING
                        Preserve ordering
  --label_tokens.instance_group_count LABEL_TOKENS.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --label_tokens.max_queue_delay_microseconds LABEL_TOKENS.MAX_QUEUE_DELAY_MICROSECONDS
                        Maximum amount of time to allow requests to queue to
                        form a batch in microseconds
  --label_tokens.classes LABEL_TOKENS.CLASSES

detokenizer:
  --detokenizer.max_sequence_idle_microseconds DETOKENIZER.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --detokenizer.max_batch_size DETOKENIZER.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --detokenizer.min_batch_size DETOKENIZER.MIN_BATCH_SIZE
  --detokenizer.opt_batch_size DETOKENIZER.OPT_BATCH_SIZE
  --detokenizer.preferred_batch_size DETOKENIZER.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --detokenizer.batching_type DETOKENIZER.BATCHING_TYPE
  --detokenizer.preserve_ordering DETOKENIZER.PRESERVE_ORDERING
                        Preserve ordering
  --detokenizer.instance_group_count DETOKENIZER.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --detokenizer.max_queue_delay_microseconds DETOKENIZER.MAX_QUEUE_DELAY_MICROSECONDS
                        Maximum amount of time to allow requests to queue to
                        form a batch in microseconds

usage: riva-build qa [-h] [-f] [--language_code LANGUAGE_CODE]
                     [--max_batch_size MAX_BATCH_SIZE] [--model_api MODEL_API]
                     [--task TASK] [--IOB IOB] [--ignore_class IGNORE_CLASS]
                     [--start_token START_TOKEN] [--end_token END_TOKEN]
                     [--max_seq_length MAX_SEQ_LENGTH] [--vocab VOCAB]
                     [--tokenizer]
                     [--tokenizer.max_sequence_idle_microseconds TOKENIZER.MAX_SEQUENCE_IDLE_MICROSECONDS]
                     [--tokenizer.max_batch_size TOKENIZER.MAX_BATCH_SIZE]
                     [--tokenizer.min_batch_size TOKENIZER.MIN_BATCH_SIZE]
                     [--tokenizer.opt_batch_size TOKENIZER.OPT_BATCH_SIZE]
                     [--tokenizer.preferred_batch_size TOKENIZER.PREFERRED_BATCH_SIZE]
                     [--tokenizer.batching_type TOKENIZER.BATCHING_TYPE]
                     [--tokenizer.preserve_ordering TOKENIZER.PRESERVE_ORDERING]
                     [--tokenizer.instance_group_count TOKENIZER.INSTANCE_GROUP_COUNT]
                     [--tokenizer.max_queue_delay_microseconds TOKENIZER.MAX_QUEUE_DELAY_MICROSECONDS]
                     [--tokenizer.unk_token TOKENIZER.UNK_TOKEN]
                     [--tokenizer.to_lower TOKENIZER.TO_LOWER]
                     [--tokenizer.bos_token TOKENIZER.BOS_TOKEN]
                     [--tokenizer.eos_token TOKENIZER.EOS_TOKEN]
                     [--language_model.max_sequence_idle_microseconds LANGUAGE_MODEL.MAX_SEQUENCE_IDLE_MICROSECONDS]
                     [--language_model.max_batch_size LANGUAGE_MODEL.MAX_BATCH_SIZE]
                     [--language_model.min_batch_size LANGUAGE_MODEL.MIN_BATCH_SIZE]
                     [--language_model.opt_batch_size LANGUAGE_MODEL.OPT_BATCH_SIZE]
                     [--language_model.preferred_batch_size LANGUAGE_MODEL.PREFERRED_BATCH_SIZE]
                     [--language_model.batching_type LANGUAGE_MODEL.BATCHING_TYPE]
                     [--language_model.preserve_ordering LANGUAGE_MODEL.PRESERVE_ORDERING]
                     [--language_model.instance_group_count LANGUAGE_MODEL.INSTANCE_GROUP_COUNT]
                     [--language_model.max_queue_delay_microseconds LANGUAGE_MODEL.MAX_QUEUE_DELAY_MICROSECONDS]
                     [--language_model.trt_max_workspace_size LANGUAGE_MODEL.TRT_MAX_WORKSPACE_SIZE]
                     [--token_classifier.max_sequence_idle_microseconds TOKEN_CLASSIFIER.MAX_SEQUENCE_IDLE_MICROSECONDS]
                     [--token_classifier.max_batch_size TOKEN_CLASSIFIER.MAX_BATCH_SIZE]
                     [--token_classifier.min_batch_size TOKEN_CLASSIFIER.MIN_BATCH_SIZE]
                     [--token_classifier.opt_batch_size TOKEN_CLASSIFIER.OPT_BATCH_SIZE]
                     [--token_classifier.preferred_batch_size TOKEN_CLASSIFIER.PREFERRED_BATCH_SIZE]
                     [--token_classifier.batching_type TOKEN_CLASSIFIER.BATCHING_TYPE]
                     [--token_classifier.preserve_ordering TOKEN_CLASSIFIER.PRESERVE_ORDERING]
                     [--token_classifier.instance_group_count TOKEN_CLASSIFIER.INSTANCE_GROUP_COUNT]
                     [--token_classifier.max_queue_delay_microseconds TOKEN_CLASSIFIER.MAX_QUEUE_DELAY_MICROSECONDS]
                     [--token_classifier.version_2_with_negative TOKEN_CLASSIFIER.VERSION_2_WITH_NEGATIVE]
                     [--token_classifier.n_best_size TOKEN_CLASSIFIER.N_BEST_SIZE]
                     [--token_classifier.max_answer_length TOKEN_CLASSIFIER.MAX_ANSWER_LENGTH]
                     output_path source_path [source_path ...]

Generate a Riva Model from a qa model trained with NVIDIA NeMo.

positional arguments:
  output_path           Location to write compiled Riva pipeline
  source_path           Source file(s)

optional arguments:
  -h, --help            show this help message and exit
  -f, --force           Overwrite existing artifacts if they exist
  --language_code LANGUAGE_CODE
                        Language of the model
  --max_batch_size MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --model_api MODEL_API
  --task TASK           Single Input or Multiple input (qa) task
  --IOB IOB
  --ignore_class IGNORE_CLASS
  --start_token START_TOKEN
  --end_token END_TOKEN
  --max_seq_length MAX_SEQ_LENGTH
                        Max no. of tokens for an input query
  --vocab VOCAB         Path for the Language Model vocab file. If not
                        specified, default will be selected which is BERT base
                        vocab
  --tokenizer           Tokenizer type

tokenizer:
  --tokenizer.max_sequence_idle_microseconds TOKENIZER.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --tokenizer.max_batch_size TOKENIZER.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --tokenizer.min_batch_size TOKENIZER.MIN_BATCH_SIZE
  --tokenizer.opt_batch_size TOKENIZER.OPT_BATCH_SIZE
  --tokenizer.preferred_batch_size TOKENIZER.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --tokenizer.batching_type TOKENIZER.BATCHING_TYPE
  --tokenizer.preserve_ordering TOKENIZER.PRESERVE_ORDERING
                        Preserve ordering
  --tokenizer.instance_group_count TOKENIZER.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --tokenizer.max_queue_delay_microseconds TOKENIZER.MAX_QUEUE_DELAY_MICROSECONDS
                        Maximum amount of time to allow requests to queue to
                        form a batch in microseconds
  --tokenizer.unk_token TOKENIZER.UNK_TOKEN
  --tokenizer.to_lower TOKENIZER.TO_LOWER
  --tokenizer.bos_token TOKENIZER.BOS_TOKEN
  --tokenizer.eos_token TOKENIZER.EOS_TOKEN

language_model:
  --language_model.max_sequence_idle_microseconds LANGUAGE_MODEL.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --language_model.max_batch_size LANGUAGE_MODEL.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --language_model.min_batch_size LANGUAGE_MODEL.MIN_BATCH_SIZE
  --language_model.opt_batch_size LANGUAGE_MODEL.OPT_BATCH_SIZE
  --language_model.preferred_batch_size LANGUAGE_MODEL.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --language_model.batching_type LANGUAGE_MODEL.BATCHING_TYPE
  --language_model.preserve_ordering LANGUAGE_MODEL.PRESERVE_ORDERING
                        Preserve ordering
  --language_model.instance_group_count LANGUAGE_MODEL.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --language_model.max_queue_delay_microseconds LANGUAGE_MODEL.MAX_QUEUE_DELAY_MICROSECONDS
                        Maximum amount of time to allow requests to queue to
                        form a batch in microseconds
  --language_model.trt_max_workspace_size LANGUAGE_MODEL.TRT_MAX_WORKSPACE_SIZE
                        Maximum workspace size (in Mb) to use for model export
                        to TensorRT

token_classifier:
  --token_classifier.max_sequence_idle_microseconds TOKEN_CLASSIFIER.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --token_classifier.max_batch_size TOKEN_CLASSIFIER.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --token_classifier.min_batch_size TOKEN_CLASSIFIER.MIN_BATCH_SIZE
  --token_classifier.opt_batch_size TOKEN_CLASSIFIER.OPT_BATCH_SIZE
  --token_classifier.preferred_batch_size TOKEN_CLASSIFIER.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --token_classifier.batching_type TOKEN_CLASSIFIER.BATCHING_TYPE
  --token_classifier.preserve_ordering TOKEN_CLASSIFIER.PRESERVE_ORDERING
                        Preserve ordering
  --token_classifier.instance_group_count TOKEN_CLASSIFIER.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --token_classifier.max_queue_delay_microseconds TOKEN_CLASSIFIER.MAX_QUEUE_DELAY_MICROSECONDS
                        Maximum amount of time to allow requests to queue to
                        form a batch in microseconds
  --token_classifier.version_2_with_negative TOKEN_CLASSIFIER.VERSION_2_WITH_NEGATIVE
  --token_classifier.n_best_size TOKEN_CLASSIFIER.N_BEST_SIZE
  --token_classifier.max_answer_length TOKEN_CLASSIFIER.MAX_ANSWER_LENGTH

usage: riva-build punctuation [-h] [-f] [--language_code LANGUAGE_CODE]
                              [--max_batch_size MAX_BATCH_SIZE]
                              [--model_api MODEL_API] [--tokenizer]
                              [--vocab VOCAB] [--to_lower TO_LOWER]
                              [--tokenizer.max_sequence_idle_microseconds TOKENIZER.MAX_SEQUENCE_IDLE_MICROSECONDS]
                              [--tokenizer.max_batch_size TOKENIZER.MAX_BATCH_SIZE]
                              [--tokenizer.min_batch_size TOKENIZER.MIN_BATCH_SIZE]
                              [--tokenizer.opt_batch_size TOKENIZER.OPT_BATCH_SIZE]
                              [--tokenizer.preferred_batch_size TOKENIZER.PREFERRED_BATCH_SIZE]
                              [--tokenizer.batching_type TOKENIZER.BATCHING_TYPE]
                              [--tokenizer.preserve_ordering TOKENIZER.PRESERVE_ORDERING]
                              [--tokenizer.instance_group_count TOKENIZER.INSTANCE_GROUP_COUNT]
                              [--tokenizer.max_queue_delay_microseconds TOKENIZER.MAX_QUEUE_DELAY_MICROSECONDS]
                              [--tokenizer.unk_token TOKENIZER.UNK_TOKEN]
                              [--tokenizer.to_lower TOKENIZER.TO_LOWER]
                              [--tokenizer.bos_token TOKENIZER.BOS_TOKEN]
                              [--tokenizer.eos_token TOKENIZER.EOS_TOKEN]
                              [--language_model.max_sequence_idle_microseconds LANGUAGE_MODEL.MAX_SEQUENCE_IDLE_MICROSECONDS]
                              [--language_model.max_batch_size LANGUAGE_MODEL.MAX_BATCH_SIZE]
                              [--language_model.min_batch_size LANGUAGE_MODEL.MIN_BATCH_SIZE]
                              [--language_model.opt_batch_size LANGUAGE_MODEL.OPT_BATCH_SIZE]
                              [--language_model.preferred_batch_size LANGUAGE_MODEL.PREFERRED_BATCH_SIZE]
                              [--language_model.batching_type LANGUAGE_MODEL.BATCHING_TYPE]
                              [--language_model.preserve_ordering LANGUAGE_MODEL.PRESERVE_ORDERING]
                              [--language_model.instance_group_count LANGUAGE_MODEL.INSTANCE_GROUP_COUNT]
                              [--language_model.max_queue_delay_microseconds LANGUAGE_MODEL.MAX_QUEUE_DELAY_MICROSECONDS]
                              [--language_model.trt_max_workspace_size LANGUAGE_MODEL.TRT_MAX_WORKSPACE_SIZE]
                              [--postprocessor.max_sequence_idle_microseconds POSTPROCESSOR.MAX_SEQUENCE_IDLE_MICROSECONDS]
                              [--postprocessor.max_batch_size POSTPROCESSOR.MAX_BATCH_SIZE]
                              [--postprocessor.min_batch_size POSTPROCESSOR.MIN_BATCH_SIZE]
                              [--postprocessor.opt_batch_size POSTPROCESSOR.OPT_BATCH_SIZE]
                              [--postprocessor.preferred_batch_size POSTPROCESSOR.PREFERRED_BATCH_SIZE]
                              [--postprocessor.batching_type POSTPROCESSOR.BATCHING_TYPE]
                              [--postprocessor.preserve_ordering POSTPROCESSOR.PRESERVE_ORDERING]
                              [--postprocessor.instance_group_count POSTPROCESSOR.INSTANCE_GROUP_COUNT]
                              [--postprocessor.max_queue_delay_microseconds POSTPROCESSOR.MAX_QUEUE_DELAY_MICROSECONDS]
                              [--postprocessor.punctuation_mapping_path POSTPROCESSOR.PUNCTUATION_MAPPING_PATH]
                              [--postprocessor.capitalization_mapping_path POSTPROCESSOR.CAPITALIZATION_MAPPING_PATH]
                              output_path source_path [source_path ...]

Generate a Riva Model from a punctuation model trained with NVIDIA NeMo.

positional arguments:
  output_path           Location to write compiled Riva pipeline
  source_path           Source file(s)

optional arguments:
  -h, --help            show this help message and exit
  -f, --force           Overwrite existing artifacts if they exist
  --language_code LANGUAGE_CODE
                        Language of the model
  --max_batch_size MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --model_api MODEL_API
  --tokenizer           Tokenizer type
  --vocab VOCAB         Path for the Language Model vocab file. If not
                        specified, default will be selected which is BERT base
                        vocab
  --to_lower TO_LOWER

tokenizer:
  --tokenizer.max_sequence_idle_microseconds TOKENIZER.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --tokenizer.max_batch_size TOKENIZER.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --tokenizer.min_batch_size TOKENIZER.MIN_BATCH_SIZE
  --tokenizer.opt_batch_size TOKENIZER.OPT_BATCH_SIZE
  --tokenizer.preferred_batch_size TOKENIZER.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --tokenizer.batching_type TOKENIZER.BATCHING_TYPE
  --tokenizer.preserve_ordering TOKENIZER.PRESERVE_ORDERING
                        Preserve ordering
  --tokenizer.instance_group_count TOKENIZER.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --tokenizer.max_queue_delay_microseconds TOKENIZER.MAX_QUEUE_DELAY_MICROSECONDS
                        Maximum amount of time to allow requests to queue to
                        form a batch in microseconds
  --tokenizer.unk_token TOKENIZER.UNK_TOKEN
  --tokenizer.to_lower TOKENIZER.TO_LOWER
  --tokenizer.bos_token TOKENIZER.BOS_TOKEN
  --tokenizer.eos_token TOKENIZER.EOS_TOKEN

language_model:
  --language_model.max_sequence_idle_microseconds LANGUAGE_MODEL.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --language_model.max_batch_size LANGUAGE_MODEL.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --language_model.min_batch_size LANGUAGE_MODEL.MIN_BATCH_SIZE
  --language_model.opt_batch_size LANGUAGE_MODEL.OPT_BATCH_SIZE
  --language_model.preferred_batch_size LANGUAGE_MODEL.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --language_model.batching_type LANGUAGE_MODEL.BATCHING_TYPE
  --language_model.preserve_ordering LANGUAGE_MODEL.PRESERVE_ORDERING
                        Preserve ordering
  --language_model.instance_group_count LANGUAGE_MODEL.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --language_model.max_queue_delay_microseconds LANGUAGE_MODEL.MAX_QUEUE_DELAY_MICROSECONDS
                        Maximum amount of time to allow requests to queue to
                        form a batch in microseconds
  --language_model.trt_max_workspace_size LANGUAGE_MODEL.TRT_MAX_WORKSPACE_SIZE
                        Maximum workspace size (in Mb) to use for model export
                        to TensorRT

postprocessor:
  --postprocessor.max_sequence_idle_microseconds POSTPROCESSOR.MAX_SEQUENCE_IDLE_MICROSECONDS
                        Global timeout, in ms
  --postprocessor.max_batch_size POSTPROCESSOR.MAX_BATCH_SIZE
                        Default maximum parallel requests in a single forward
                        pass
  --postprocessor.min_batch_size POSTPROCESSOR.MIN_BATCH_SIZE
  --postprocessor.opt_batch_size POSTPROCESSOR.OPT_BATCH_SIZE
  --postprocessor.preferred_batch_size POSTPROCESSOR.PREFERRED_BATCH_SIZE
                        Preferred batch size, must be smaller than Max batch
                        size
  --postprocessor.batching_type POSTPROCESSOR.BATCHING_TYPE
  --postprocessor.preserve_ordering POSTPROCESSOR.PRESERVE_ORDERING
                        Preserve ordering
  --postprocessor.instance_group_count POSTPROCESSOR.INSTANCE_GROUP_COUNT
                        How many instances in a group
  --postprocessor.max_queue_delay_microseconds POSTPROCESSOR.MAX_QUEUE_DELAY_MICROSECONDS
                        Maximum amount of time to allow requests to queue to
                        form a batch in microseconds
  --postprocessor.punctuation_mapping_path POSTPROCESSOR.PUNCTUATION_MAPPING_PATH
  --postprocessor.capitalization_mapping_path POSTPROCESSOR.CAPITALIZATION_MAPPING_PATH

Pretrained Models¶

Task	Architecture	Language	Dataset	Domain	Accuracy	Compatibility with TAO Toolkit 3.0-21.08	Compatibility with Nemo 1.0.0b4	Link
QA	BERT	English	SQuAD 2.0		EM: 71.24 F1: 74.32	Yes	Yes	RIVA / RMIR
QA	Megatron	English	SQuAD 2.0		TBM	Yes	Yes	RIVA / RMIR
Entity Recognition	BERT	English	GMB (Groningen Meaning Bank)	LOC, ORG, PER, GPE, TIME, MISC, O		Yes	Yes	RIVA / RMIR
Punctuation/Capitalization	BERT	English	Tatoeba sentences, Books from the Project Gutenberg that were used as part of the LibriSpeech corpus, Transcripts from Fisher English Training Speech			Yes	Yes	RIVA / RMIR
Intent Detection & Slot Tagging	BERT	English	Proprietary	Weather		Yes	Yes	RIVA / RMIR
Text Classification	BERT	English	Proprietary			Yes	Yes	RIVA / RMIR

The aforementioned NLP tasks leverage large scale pretrained models such as BERT-base, BERT-large, and Megatron-LM BERT. Here are pointers to the respective pretrained models hosted on NGC.