This guide covers the KServe gRPC frontend configuration and integration for the Dynamo Frontend.
KServe v2 API is one of the industry-standard protocols for machine learning model inference. Triton inference server is one of the inference solutions that comply with KServe v2 API and it has gained a lot of adoption. To quickly enable Triton users to explore with Dynamo benefits, Dynamo provides a KServe gRPC frontend.
This documentation assumes readers are familiar with the usage of KServe v2 API and focuses on explaining the Dynamo parts that work together to support KServe API and how users may migrate existing KServe deployment to Dynamo.
ModelInfer endpoint: KServe Standard endpoint as described hereModelStreamInfer endpoint: Triton extension endpoint that provide bi-directional streaming version of the inference RPC to allow a sequence of inference requests/responses to be sent over a GRPC stream, as described hereModelMetadata endpoint: KServe standard endpoint as described hereModelConfig endpoint: Triton extension endpoint as described hereTo start the KServe frontend, run the below command:
The gRPC server supports optional HTTP/2 flow control tuning via environment variables. These can be set before starting the server to optimize for high-throughput streaming workloads.
If these variables are not set, the server uses tonic’s default values.
Tune these values based on your workload. Connection window should accommodate concurrent_requests x request_size. Memory overhead equals the connection window size (shared across all streams). See gRPC performance best practices and gRPC channel arguments for more details.
Similar to HTTP frontend, the registered backend will be auto-discovered and added to the frontend list of serving model. To register a backend, the same register_model() API will be used. Currently the frontend support serving of the following model type and model input combination:
ModelType::Completions and ModelInput::Text: Combination for LLM backend that uses custom preprocessorModelType::Completions and ModelInput::Token: Combination for LLM backend that uses Dynamo preprocessor (i.e. Dynamo SGLang / TRTLLM / vLLM backend)ModelType::TensorBased and ModelInput::Tensor: Combination for backend that is used for generic tensor-based inferenceThe first two combinations are backed by OpenAI Completions API, see OpenAI Completions section for more detail. Whereas the last combination is most aligned with KServe API and the users can replace existing deployment with Dynamo once their backends implements adaptor for NvCreateTensorRequest/NvCreateTensorResponse, see Tensor section for more detail:
Most of the Dynamo features are tailored for LLM inference and the combinations that are backed by OpenAI API can enable those features and are best suited for exploring those Dynamo features. However, this implies specific conversion between generic tensor-based messages and OpenAI message and imposes specific structure of the KServe request message.
The metadata and config endpoint will report the registered backend to have the below, note that this is not the exact response.
On receiving inference request, the following conversion will be performed:
text_input: the element is expected to contain the user prompt string and will be converted to prompt field in OpenAI Completion requeststreaming: the element will be converted to stream field in OpenAI Completion requestOn receiving model response, the following conversion will be performed:
text_output: each element corresponds to one choice in OpenAI Completion response, and the content will be set to text of the choice.finish_reason: each element corresponds to one choice in OpenAI Completion response, and the content will be set to finish_reason of the choice.This combination is used when the user is migrating an existing KServe-based backend into Dynamo ecosystem.
When registering the backend, the backend must provide the model’s metadata as tensor-based deployment is generic and the frontend can’t make any assumptions like for OpenAI Completions model. There are two methods to provide model metadata:
version: 1, platform: "dynamo", backend: "dynamo". Note that for model config endpoint, the rest of the fields will be set to their default values.TensorModelConfig::triton_model_config which supersedes other fields in TensorModelConfig and be used for endpoint responses. triton_model_config is expected to be the serialized string of the ModelConfig protobuf message, see echo_tensor_worker.py for example.When receiving inference request, the backend will receive NvCreateTensorRequest and be expected to return NvCreateTensorResponse, which are the mapping of ModelInferRequest / ModelInferResponse protobuf message in Dynamo.
The frontend may be started via Python binding, this is useful when integrating Dynamo in existing system that desire the frontend to be run in the same process with other components. See server.py for example.
The frontend includes an integrated router for request distribution. Configure routing mode:
See Router Documentation for routing configuration details.
Backends auto-register with the frontend when they call register_model(). Supported backends: