Clara Deploy SDK Model Repository
Clara Deploy SDK Platform Model Repository is a service which is available starting with Clara Deploy SDK version 0.7.2 and above. The Model Repository provides:
Centralized repository of all inference-models.
Common mechanism for pipeline-operators to reference and make use of inference-models.
Easy way to ensure that inference-models are accessible by the pipeline-operators which request them.
Fully initialized Triton Inference Server with all requested models preloaded.
Model instance reuse to avoid the overhead of unnecessary model loading.
Starting with Clara Deploy SDK version 0.7.2, Clara Platform Server is deployed with Model Repository support. Models can be uploaded using either the GRPC API or the Clara CLI. The API and CLI support several other Model Repository functions such as list all known models, removing a model from the repository, list all models known to the repository, and downloading models from the repository.
Using Model Repository as part of your pipelines is as easy as adding a
model: list to a pipeline-operator definition. When an operator definition contains a list of models, Platform Server will ensure that a Triton Inference Server, preloaded with all requested models, is available to the requesting operator before starting the job.
When a pipeline definition contains an operator definition which request one or more inference-models, Platform Server ensures that the model is available from its Model Repository, and will ensure that the model is preloaded on a Triton Inference Server before starting the any jobs associated with the pipeline.
Once the pipeline-job has started, the URI [uniform resource indicator] will be made available to the operator as an environment variable. Once the pipeline-job has started, the HTTP and GRPC URIs [uniform resource indicator] will be made available to the operator as environment variables. Reading the
NVIDIA_TRITON_GRPCURI environment variable will provide your operator with a connection string which can be used with Triton Inference GRPC Client to connect to Triton server and perform inference operations. Reading the
NVIDIA_TRITON_HTTPURI environment variable will provide your operator with a connection string which can be used with Triton Inference HTTP Client to connect to Triton server and perform inference operations.
When pipeline definitions include model requests, Platform Server will ensure:
All models requested by an operator are present and available from the same Triton Inference Server.
When multiple operators in the same pipeline definition make model requests, each operator will be provided with a
Where possible, Triton Inference Servers will be shared across the operators of the same pipeline-job. This is to minimize the number of resources required to execute the pipeline-job and to improve job concurrency.
Triton instance will not be concurrently shared between pipeline-jobs, even when multiple jobs using the same pipeline definition are executing concurrently.
Operator model requests will be best matched, to minimize unnecessary model loading waits, with Triton servers which already have most or all of the requested models loaded, and are not currently assigned to an executing pipeline-job.