Running the Server¶

For best performance the TensorRT Inference Server should be run on a system that contains Docker, nvidia-docker, CUDA and one or more supported GPUs, as explained in Running The Inference Server. The inference server can also be run on non-CUDA, non-GPU systems as described in Running The Inference Server On A System Without A GPU.

Example Model Repository¶

Before running the TensorRT Inference Server, you must first set up a model repository containing the models that the server will make available for inferencing.

An example model repository containing a Caffe2 ResNet50, a TensorFlow Inception model, and a simple TensorFlow GraphDef model (used by the simple_client example) are provided in the docs/examples/model_repository directory. Before using the example model repository you must fetch any missing model definition files from their public model zoos:

$ cd docs/examples
$ ./fetch_models.sh

Running The Inference Server¶

Before running the inference server, you must first set up a model repository containing the models that the server will make available for inferencing. Section Model Repository describes how to create your own model repository. You can also use Example Model Repository to set up an example model repository.

Assuming the sample model repository is available in /path/to/model/repository, the following command runs the container you pulled from NGC or built locally:

$ nvidia-docker run --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p8000:8000 -p8001:8001 -p8002:8002 -v/path/to/model/repository:/models <tensorrtserver image name> trtserver --model-store=/models

Where <tensorrtserver image name> will be something like nvcr.io/nvidia/tensorrtserver:19.04-py3 if you pulled the container from the NGC registry, or tensorrtserver if you built it from source.

The nvidia-docker -v option maps /path/to/model/repository on the host into the container at /models, and the --model-store option to the server is used to point to /models as the model repository.

The -p flags expose the container ports where the inference server listens for HTTP requests (port 8000), listens for GRPC requests (port 8001), and reports Prometheus metrics (port 8002).

The --shm-size and --ulimit flags are recommended to improve the server’s performance. For --shm-size the minimum recommended size is 1g but larger sizes may be necessary depending on the number and size of models being served.

For more information on the Prometheus metrics provided by the inference server see Metrics.

Running The Inference Server On A System Without A GPU¶

On a system without GPUs, the inference server should be run using docker instead of nvidia-docker, but is otherwise identical to what is described in Running The Inference Server:

$ docker run --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p8000:8000 -p8001:8001 -p8002:8002 -v/path/to/model/repository:/models <tensorrtserver image name> trtserver --model-store=/models

Because a GPU is not available, the inference server will be unable to load any model configuration that requires a GPU or that specified a GPU instance by an instance-group configuration.

Checking Inference Server Status¶

The simplest way to verify that the inference server is running correctly is to use the Status API to query the server’s status. From the host system use curl to access the HTTP endpoint to request server status. The response is protobuf text showing the status for the server and for each model being served, for example:

$ curl localhost:8000/api/status
id: "inference:0"
version: "0.6.0"
uptime_ns: 23322988571
model_status {
  key: "resnet50_netdef"
  value {
    config {
      name: "resnet50_netdef"
      platform: "caffe2_netdef"
    }
    ...
    version_status {
      key: 1
      value {
        ready_state: MODEL_READY
      }
    }
  }
}
ready_state: SERVER_READY

This status shows configuration information as well as indicating that version 1 of the resnet50_netdef model is MODEL_READY. This means that the server is ready to accept inferencing requests for version 1 of that model. A model version ready_state will show up as MODEL_UNAVAILABLE if the model failed to load for some reason.