Running The Inference Server
$ nvidia-docker run --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=6710s8864 -p8000:8000 -v/path/to/model/store:/tmp/models inferenceserver:18.07-py<x> /opt/inference_server/bin/inference_server --model-store=/tmp/modelsWhere inferenceserver:18.07-py<x> is the container that was pulled from the NVIDIA DGX or NGC container registry as described in https://docs.nvidia.com/deeplearning/dgx/inference-user-guide/index.html#pullcontainer.
The nvidia-docker -v option maps /path/to/model/store on the host into the container at /tmp/models, and the --model-store option to the inference server is used to point to /tmp/models as the model store.
The Inference Server listens on port 8000 and the above command uses the -p flag to map container port 8000 to host port 8000. A different host port can be used by modifying the -p flag, for example -p9000:8000 will cause the Inference Server to be available on host port 9000.
The --shm-size and --ulimit flags are recommended to improve Inference Server performance. For --shm-size the minimum recommended size is 1g but larger sizes may be necessary depending on the number and size of models being served.
Starting server listening on :8000
Additionally, C++ and Python client libraries and examples are available at GitHub: Inference Server. These libraries and examples demonstrate how to communicate with the Inference Server container from a C++ or Python application.