Overview#

This solution guide outlines the creation of an AI pipeline on NVIDIA AI Enterprise by leveraging a Natural Language Processing use case example.

A library of pre-trained models is available through NVIDIA’s NGC catalog for use with the NVIDIA AI Enterprise software suite; these models can be fine-tuned on your datasets using NVIDIA AI Enterprise TensorFlow Containers. Within this guide, we will highlight how AI Practitioners can use Virtual Machines running on mainstream, NVIDIA-Certified Systems to execute training using pre-trained models. These VMs are based on templates, containing the BERT TensorFlow container from the NVIDIA NGC Catalog. Using a sample Jupyter notebook, the model is trained, saved, and then converted to TensorRT for best performance. The model is then deployed to production for Inference using the Triton Inference Server.

Deep learning as an algorithm requires a massive amount of data to train a model with millions of parameters to ensure the performance and accuracy of the model is suitable for real-world use cases. But not every customer use case has access to a large quantity of data. Pre-trained models that have been trained on a large amount of generalized data can be fine-tuned on much smaller datasets to get the needed amount of accuracy on the customer-specific use case.

The neural network model is optimized for deployment using TensorRT and deployed on VMs using Triton Inference Server to serve different end-users who leverage the server in their customer-facing applications.

Triton Inference Server#

Triton Inference Server is the best deployment solution for inference – GPU or CPU – simplifying inference deployment without compromising performance. Triton Inference Server can deploy models trained using TensorFlow, PyTorch, ONNX, and TensorRT. It is recommended to convert the models into TensorRT format for the best performance.

What is TensorRT?#

The core of NVIDIA TensorRT is a C++ library that facilitates high performance inference on NVIDIA GPUs. It is designed to work in a complementary fashion with training frameworks such as TensorFlow and PyTorch. It focuses specifically on running an already-trained network quickly and efficiently on the NVIDIA GPU. TensorRT optimizes the network by combining layers and optimizing kernel selection for improved latency, throughput, power efficiency, and memory consumption. If the application specifies, it can also optimize the network to run in lower precision, further increasing performance and reducing memory requirements.

Within this guide we will walk through converting a model to TensorRT. Refer to the NVIDIA Developer page for more information regarding how to get started with TensorRT.

Triton Inference Server Architecture#

The Triton Inference Server uses models stored in a model repository, available locally to serve inference requests. Once they are available in Triton, inference requests are sent from a client application. Python and C++ libraries provide APIs to simplify communication. Clients send HTTP/REST requests directly to Triton using HTTP/REST or gRPC protocols.

The Triton Client SDK has APIs to construct a client which serializes a request to send it to the server over the network. The server then places in the request queue after deserializing it. Requests are queued together for optimal performance and then computed as a batch (the queue size, batch size, and the number of concurrent requests are configurable depending on the use case and model). The result is re-serialized and sent back to the client, then deserialized by the client, and processed.

This guide takes examples of a Natural Language Processing task for inference using Triton Inference Server, providing the reader the complete picture of the enterprise workflow.