Running The TensorRT TensorRT inference server

Before running the TensorRT inference server, you must first set up a model store containing the models that the server will make available for inferencing. The TensorRT inference server User Guide - Model Store, describes how to create a model store. For this example, assume the model store is created on the host system directory /path/to/model/store.

For 18.09 and later releases, the following command will launch the TensorRT inference server using that model store.
$ nvidia-docker run --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p8000:8000 -p8001:8001 -v/path/to/model/store:/tmp/models inferenceserver:18.xx-py3 /opt/tensorrtserver/bin/trtserver --model-store=/tmp/models 
For 18.08 and earlier releases, use the following command:
$ nvidia-docker run --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p8000:8000 -p8001:8001 -v/path/to/model/store:/tmp/models inferenceserver:18.xx-py3 /opt/inference_server/bin/inference_server --model-store=/tmp/models 

Where inferenceserver:18.xx-py3 is the container that was pulled from the NGC container registry as described in https://docs.nvidia.com/deeplearning/dgx/inference-user-guide/index.html#pullcontainer.

The nvidia-docker -v option maps /path/to/model/store on the host into the container at /tmp/models, and the --model-store option to the inference server is used to point to /tmp/models as the model store.

The TensorRT inference server listens on port 8000 and the above command uses the -p flag to map container port 8000 to host port 8000. A different host port can be used by modifying the -p flag, for example -p9000:8000 will cause the TensorRT inference server to be available on host port 9000.

The --shm-size and --ulimit flags are recommended to improve TensorRT inference server performance. For --shm-size the minimum recommended size is 1g but larger sizes may be necessary depending on the number and size of models being served.

After starting, the TensorRT inference server will log initialization information to the console. Initialization is complete and the server is ready to accept requests after the console shows the following:
Starting server listening on :8000

Additionally, C++ and Python client libraries and examples are available at GitHub: TensorRT inference server. These libraries and examples demonstrate how to communicate with the TensorRT inference server container from a C++ or Python application.