Running The Inference Server
$ nvidia-docker run --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p8000:8000 --mount type=bind,source=/path/to/model/store,target=/tmp/models <container> /opt/inference_server/bin/inference_server --model-store=/tmp/modelsWhere <container> is the name of the docker container that was pulled from the NVIDIA DGX or NGC container registry as described in Pulling A Container.
The nvidia-docker --mount option maps /path/to/model/store on the host into the container at /tmp/models, and the --model-store option to the inference server is used to point to /tmp/models as the model store.
The Inference Server listens on port 8000 and the above command uses the -p flag to map container port 8000 to host port 8000. A different host port can be used by modifying the -p flag, for example -p9000:8000 will cause the Inference Server to be available on host port 9000.
The --shm-size and --ulimit flags are recommended to improve Inference Server performance. For --shm-size the minimum recommended size if 1g but larger sizes may be necessary depending on the number and size of models being served.
Starting server listening on :8000
Additionally, C++ and Python client libraries and examples are available at GitHub: Inference Server. These libraries and examples demonstrate how to communicate with the Inference Server container from a C++ or Python application.