Natural Language Processing#


By pretraining a model like BERT in an unsupervised fashion, NLP practitioners are able to create application-specific models by simply adding a different “head” (or output layer) to the model and fine-tune the augmented model with in-domain data for the desired task. Riva NLP enables deployment of models trained in this manner.

Riva NLP supports models that are BERT-based. Google’s BERT (Bidirectional Encoder Representations from Transformers) is, as the name implies, a transformer-based language model. After pretrained, adding a single layer as necessary for the downstream task allows the model to be fine-tuned and achieve state-of-the-art results (at the time) across a wide variety of disparate NLP tasks. While new models have built on BERT’s success, its relative simplicity, parameter count, and good task-specific performance, make it a compelling choice for a latency-sensitive NLP deployment. Most fine-tuning tasks can run in a few hours on a single GPU. For more information about BERT, refer to the BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding paper.


DistilBERT, a distilled version of BERT, is a transformer model architecture that is smaller, faster, cheaper, and lighter than BERT. It has 40% less number of parameters, is 60% faster, and retains 97% of BERT’s language understanding capabilities. For more details on DistilBERT, refer to the DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter paper.

Due to its lower resource usage capabilities, DistilBERT is the preferred model for deployment on embedded platforms.


Megatron is a transformer model architecture inspired by BERT that is designed to scale up to billions of parameters. When training NLP models for deployment with Riva, you can select between standard BERT and Megatron. For more details on Megatron, refer to the Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism paper.