FIL Backend Model Configuration#
Like all Triton
backends,
models deployed via the FIL backend make use of a specially laid-out “model
repository” directory containing at least one serialized model and a
config.pbtxt configuration file:
model_repository/
├─ example/
│ ├─ 1/
│ │ ├─ model.json
│ ├─ 2/
│ │ ├─ model.json
│ ├─ config.pbtxt
Documentation for general Triton configuration options is available in the main Triton docs, but here, we will review options specific to the FIL backend. For a more succinct overview, refer to the FAQ notebook, which includes guides for configuration under specific deployment scenarios.
Structure of configuration file#
A typical config.pbtxt file might look something like this:
backend: "fil"
max_batch_size: 32768
input [
{
name: "input__0"
data_type: TYPE_FP32
dims: [ 32 ]
}
]
output [
{
name: "output__0"
data_type: TYPE_FP32
dims: [ 1 ]
}
]
instance_group [{ kind: KIND_AUTO }]
parameters [
{
key: "model_type"
value: { string_value: "xgboost_ubj" }
},
{
key: "is_classifier"
value: { string_value: "true" }
}
]
dynamic_batching {}
Note that (as suggested by the file extension), this is a Protobuf text file and should be formatted accordingly.
Specifying the backend#
If you wish to use the FIL backend, you must indicate this in the
configuration file with the top-level backend: "fil" option. For
information on models supported by the FIL backend, see Model
Support
or the FAQ
notebook.
Maximum Batch Size#
Because of the relatively quick execution speed of most tree models and the inherent parallelism of FIL, typically the only limitation on the maximum batch size is the size of Triton’s CUDA memory pool set at launch. Nevertheless, you must specify some maximum batch size here, so setting it to whatever (large) value is consistent with your memory usage needs alongside other models should work fine:
max_batch_size: 1048576
Inputs and Outputs#
Input and output tensors for a model are described using three entries:
name, data_type, and dims. The name for the input tensor should
always be "input__0", and the name for the primary output tensor should
always be "output__0".
The FIL backend currently
exclusively
uses 32-bit precision. 64-bit model parameters are rounded to 32-bit values,
although optional support for 64-bit execution should be added in the near
future. At present, however, both inputs and outputs should always use
TYPE_FP32 for their data_type.
The dimensions of the I/O tensors do not include the batch dimension and are model-dependent. The input tensor’s single dimension should just be the number of features (columns) in a single input sample (row).
The output tensor’s dimensions depend on whether the predict_proba
option (described below) is used or not. If it is not used, the return value
for each sample is just the index of the output class. In this case, the
single output dimension is just 1. Otherwise, a probability will be returned for every class in the model and the single output dimension should be the number of classes. Binary models are considered to have a single class for data transfer efficiency.
Below, we see an example I/O specification for a model with 32 input
features and 3 output classes with the predict_proba flag enabled:
input [
{
name: "input__0"
data_type: TYPE_FP32
dims: [ 32 ]
}
]
output [
{
name: "output__0"
data_type: TYPE_FP32
dims: [ 3 ]
}
]
Specifying CPU/GPU execution#
Triton loads multiple copies or “instances” of its models to help take
maximum advantage of available hardware. You may control details of this via
the top-level instance_group option.
The simplest use of this option is simply:
instance_group [{ kind: KIND_AUTO }]
This will load one instance on each available NVIDIA GPU. If no compatible
NVIDIA GPU is found, a single instance will instead be loaded on the CPU.
KIND_GPU and KIND_CPU are used in place of KIND_AUTO if you wish to
explicitly specify GPU or CPU execution.
You may also specify count in addition to kind if you wish to load
additional model instances. See the main Triton docs for more information.
Dynamic Batching#
One of the most useful features of the Triton server is its ability to batch multiple requests and evaluate them together. To enable this feature, include the following top-level option:
dynamic_batching {}
Except for cases where latency is critical down to the level of microseconds, we strongly recommend enabling dynamic batching for all deployments.
FIL-Specific Options#
All other options covered here are specific to FIL and will go in the
parameters section of the configuration. Triton’s backend-specific parameters
are represented as string values and converted when read.
Model Type#
The FIL backend accepts models in a number of serialization formats, including XGBoost JSON and binary formats, LightGBM’s text format, and Treelite’s checkpoint format. For more information, see Model Support.
The model_type option is used to indicate which of these serialization
formats your model uses: xgboost_ubj for XGBoost UBJSON, xgboost_json for
XGBoost JSON, xgboost for XGBoost binary (legacy), lightgbm for LightGBM,
or treelite_checkpoint for Treelite:
parameters [
{
key: "model_type"
value: { string_value: "xgboost_ubj" }
}
]
Model Filenames#
For each model type, Triton expects a particular default filename:
xgboost.ubjfor XGBoost UBJSONxgboost.jsonfor XGBoost JSONxgboost.modelfor XGBoost Binary (Legacy)model.txtfor LightGBMcheckpoint.tlfor Treelite It is recommended that you use these filenames, but custom filenames can be specified using Triton’s usual configuration options.
Classification vs. Regression (is_classifier)#
Set is_classifier to true if your model is a classification model or
false if your model is a regressor:
parameters [
{
key: "is_classifier"
value: { string_value: "true" }
}
]
Classification confidence scores (predict_proba)#
For classifiers, if you wish to return a confidence score for each class
rather than simply a class ID, set predict_proba to true
parameters [
{
key: "predict_proba"
value: { string_value: "true" }
}
]
Allow unknown fields in XGBoost JSON model (xgboost_allow_unknown_field)#
For XGBoost JSON models, ignore unknown fields instead of throwing a validation error. This flag is ignored for other kinds of models.
parameters [
{
key: "xgboost_allow_unknown_field"
value: { string_value: "true" }
}
]
Decision Threshold#
For binary classifiers, it is sometimes helpful to set a specific
confidence threshold for positive decisions. This can be set via the
threshold parameter. If unset, an implicit threshold of 0.5 is used for
binary classifiers.
parameters [
{
key: "threshold"
value: { string_value: "0.3" }
}
]
Performance parameters#
The FIL backend includes several parameters that can be tuned to optimize latency and throughput for your model. These parameters will not affect model output, but experimenting with them can significantly improve model performance for your specific use case.
chunk_size (GPU only)#
This parameter applies only to GPU deployments and determines how batches are further subdivided for parallel processing. The optimal value depends on hardware, model, and batch size, and it is difficult to predict in advance.
In general, servers under higher load or those receiving larger batches from
clients will benefit from a higher value. chunk_size can be any power of 2
between 1 and 32.
parameters [
{
key: "chunk_size"
value: { string_value: "4" }
}
]
layout (GPU only)#
This parameter determines how nodes within a tree are organized in memory as well as how they are accessed during inference.
The depth_first and breadth_first options lay out the nodes in the
depth-first and breadth-first orders, respectively.
The layered option is unique,
in that it interleaves nodes from multiple trees, unlike depth_first
or breadth_first. When layered option is chosen, FIL will traverse
the forest by proceeding through the root nodes of each tree first,
followed by the children of those root nodes for each tree,
and so forth. This traversal order ensures that all nodes of a
particular tree at a particular depth are traversed together.
In general, depth_first tends to work the best for deeper trees
(depth 4 or greater), whereas breadth_first tends to work better
for shallow trees. Your mileage may vary, so make sure to measure
performance to choose the best configuration.
parameters [
{
key: "layout"
value: { string_value: "depth_first" }
}
]
transfer_threshold (GPU only)#
For extremely lightweight models operating on a deployment with very light
traffic and small batch sizes, the overhead of moving data to the GPU
sometimes outweighs the faster inference. As traffic increases, however,
you will want to take advantage of available NVIDIA GPUs to provide optimal
throughput/latency. To facilitate this, the transfer_threshold can be set to
some integer value indicating the number of rows beyond which data should
be transferred to the GPU. If this setting is beneficial at all, it
typically takes on a small value (~1-5 for typical hardware
configurations). Most models are unlikely to benefit from setting this to any
value other than 0, however.
parameters [
{
key: "transfer_threshold"
value: { string_value: "2" }
}
]