Running The TensorRT inference server

Before running the inference server, you must first set up a model store containing the models that the server will make available for inferencing. The TensorRT Inference Server documentation describes how to create a model store. For this example, assume the model store is created on the host system directory /path/to/model/store.

The following command will launch the inference server using that model store.
docker run --gpus all --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p8000:8000 -p8001:8001 -v/path/to/model/store:/tmp/models tensorrtserver:19.xx-py3 /opt/tensorrtserver/bin/trtserver --model-store=/tmp/models

Where tensorrtserver:19.xx-py3 is the container that was pulled from the NGC container registry as described in Installing PreBuilt Containers.

The nvidia-docker -v option maps /path/to/model/store on the host into the container at /tmp/models, and the --model-store option to the inference server is used to point to /tmp/models as the model store.

The inference server listens on port 8000 and the above command uses the -p flag to map container port 8000 to host port 8000. A different host port can be used by modifying the -p flag, for example -p9000:8000 will cause the inference server to be available on host port 9000.

The --shm-size and --ulimit flags are recommended to improve inference server performance. For --shm-size the minimum recommended size is 1g but larger sizes may be necessary depending on the number and size of models being served.

Additionally, C++ and Python client libraries and examples are available at GitHub: TensorRT Inference Server. These libraries and examples demonstrate how to communicate with the Inference Server container from a C++ or Python application.