Triton Inference Server Overview
Now that you have successfully trained and saved the model, the next step is to deploy the model to Triton Inference Server. Within this next section of the lab, you will become familiar with the key elements for successfully deploying trained models to Triton Inference Server for fraud detection. We are leveraging the same VM to train the model and run Triton Inference Server for this lab.
Triton Inference Server
Triton Inference Server simplifies the deployment of AI models by serving inference requests at scale in production. It lets teams deploy trained AI models from any framework (TensorFlow, NVIDIA® TensorRT, PyTorch, ONNX Runtime, or custom) and run those models on any GPU- or CPU-based infrastructure (cloud, data center, or edge).
Triton Inference FIL Backend
An XGBoost model can be deployed on NVIDIA’s Triton Inference Server using the server’s Forest Inference Library (FIL) backend, which is one of the many frameworks it supports. By using the Triton inference server with NVIDIA GPUs, users can achieve heavily improved throughput and latency on their requests. This is useful for workflows such as fraud detection because it is important to both make individual predictions quickly and make many predictions in a fixed amount of time to meet demand.
XGBoost Model Storage
Before starting the server, you will need to set up a “model repository” directory containing the model you wish to serve as well as a configuration file. The FIL backend currently supports forest models serialized in XGBoost’s binary format, XGBoost’s JSON format, LightGBM’s text format, and Treelite’s binary checkpoint format. In this lab we used XGBoost’s binary format to serialize the model.
Once you have a serialized model, you will need to prepare a directory structure similar to the following example, which uses an XGBoost binary file:
model_repository/
`-- fil
|-- 1
| `-- xgboost.model
`-- config.pbtx
We complete preparing the model files, configuration files, and the directory structure for you in this lab.
You can read more about storing and configuring models for inference on Triton in the Triton FIL Github. Feel free to make modifications to the default configuration and deployment if you wish.