Running the Server

Example Model Repository

Before running the TensorRT Inference Server, you must first set up a model repository containing the models that TRTIS will make available for inferencing.

An example model repository containing a Caffe2 ResNet50, a TensorFlow Inception model, and a simple TensorFlow GraphDef model (used by the simple_client example) are provided in the docs/examples/model_repository directory. Before using the example model repository you must fetch any missing model definition files from their public model zoos:

$ cd docs/examples
$ ./fetch_models.sh

Running The Inference Server

Before running TRTIS, you must first set up a model repository containing the models that TRTIS will make available for inferencing. Section Model Repository describes how to create your own model repository. You can also use Example Model Repository to set up an example model repository.

Assuming the sample model repository is available in /path/to/model/repository, the following command runs the container you pulled from NGC or built locally:

$ nvidia-docker run --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p8000:8000 -p8001:8001 -p8002:8002 -v/path/to/model/repository:/models <tensorrtserver image name> trtserver --model-store=/models

Where <tensorrtserver image name> will be something like nvcr.io/nvidia/tensorrtserver:19.01-py3 if you pulled the container from the NGC register, or tensorrtserver if you built it from source.

The nvidia-docker -v option maps /path/to/model/repository on the host into the container at /models, and the --model-store option to TRTIS is used to point to /models as the model repository.

The -p flags expose the container ports where TRTIS listens for HTTP requests (port 8000), listens for GRPC requests (port 8001), and reports Prometheus metrics (port 8002).

The --shm-size and --ulimit flags are recommended to improve TRTIS performance. For --shm-size the minimum recommended size is 1g but larger sizes may be necessary depending on the number and size of models being served.

For more information on the Prometheus metrics provided by the inference server see Metrics.

Checking Inference Server Status

The simplest way to verify that TRTIS is running correctly is to use the Status API to query the server’s status. From the host system use curl to access the HTTP endpoint to request server status. The response is protobuf text showing the status for the server and for each model being served, for example:

$ curl localhost:8000/api/status
id: "inference:0"
version: "0.6.0"
uptime_ns: 23322988571
model_status {
  key: "resnet50_netdef"
  value {
    config {
      name: "resnet50_netdef"
      platform: "caffe2_netdef"
    }
    ...
    version_status {
      key: 1
      value {
        ready_state: MODEL_READY
      }
    }
  }
}
ready_state: SERVER_READY

This status shows configuration information as well as indicating that version 1 of the resnet50_netdef model is MODEL_READY. This means that TRTIS is ready to accept inferencing requests for version 1 of that model. A model version ready_state will show up as MODEL_UNAVAILABLE if the model failed to load for some reason.