Inference Server Release Notes :: Deep Learning DGX Documentation

Running The Inference Server

Before running the Inference Server, you must first set up a model store containing the models that the server will make available for inferencing. The Inference Server User Guide - Model Store, describes how to create a model store. For this example, assume the model store is created on the host system directory /path/to/model/store. The following command will launch the Inference Server using that model store.

$ nvidia-docker run --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p8000:8000 --mount type=bind,source=/path/to/model/store,target=/tmp/models <container> /opt/inference_server/bin/inference_server --model-store=/tmp/models

Where <container> is the name of the docker container that was pulled from the NVIDIA DGX or NGC container registry as described in Pulling A Container.

The nvidia-docker --mount option maps /path/to/model/store on the host into the container at /tmp/models, and the --model-store option to the inference server is used to point to /tmp/models as the model store.

The Inference Server listens on port 8000 and the above command uses the -p flag to map container port 8000 to host port 8000. A different host port can be used by modifying the -p flag, for example -p9000:8000 will cause the Inference Server to be available on host port 9000.

The --shm-size and --ulimit flags are recommended to improve Inference Server performance. For --shm-size the minimum recommended size if 1g but larger sizes may be necessary depending on the number and size of models being served.

After starting, the Inference Server will log initialization information to the console. Initialization is complete and the server is ready to accept requests after the console shows the following:

Starting server listening on :8000

Additionally, C++ and Python client libraries and examples are available at GitHub: Inference Server. These libraries and examples demonstrate how to communicate with the Inference Server container from a C++ or Python application.