Running The TensorRT inference server

Before running the inference server, you must first set up a model store containing the models that the server will make available for inferencing. The TensorRT Inference Server documentation describes how to create a model store. For this example, assume the model store is created on the host system directory /path/to/model/store.

The following command will launch the inference server using that model store.
$ nvidia-docker run --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p8000:8000 -p8001:8001 -v/path/to/model/store:/tmp/models inferenceserver:19.xx-py3 /opt/tensorrtserver/bin/trtserver --model-store=/tmp/models

Where inferenceserver:19.xx-py3 is the container that was pulled from the NGC container registry as described in Installing PreBuilt Containers.

The nvidia-docker -v option maps /path/to/model/store on the host into the container at /tmp/models, and the --model-store option to the inference server is used to point to /tmp/models as the model store.

The inference server listens on port 8000 and the above command uses the -p flag to map container port 8000 to host port 8000. A different host port can be used by modifying the -p flag, for example -p9000:8000 will cause the inference server to be available on host port 9000.

The --shm-size and --ulimit flags are recommended to improve inference server performance. For --shm-size the minimum recommended size is 1g but larger sizes may be necessary depending on the number and size of models being served.

After starting, the inference server will log initialization information to the console. Initialization is complete and the server is ready to accept requests after the console shows the following:
Starting server listening on :8000

Additionally, C++ and Python client libraries and examples are available at GitHub: TensorRT Inference Server. These libraries and examples demonstrate how to communicate with the Inference Server container from a C++ or Python application.