Sequence Extension#

This document describes Triton’s sequence extension. The sequence extension allows Triton to support stateful models that expect a sequence of related inference requests.

An inference request can specify that it is part of a sequence using the “sequence_id” parameter in the request and by using the “sequence_start” and “sequence_end” parameters to indicate the start and end of sequences.

Because this extension is supported, Triton reports “sequence” in the extensions field of its Server Metadata. Triton may additionally report “sequence(string_id)” in the extensions field of the Server Metadata if the “sequence_id” parameter supports string types.

“sequence_id” : a string or uint64 value that identifies the sequence to which a request belongs. All inference requests that belong to the same sequence must use the same sequence ID. A sequence ID of 0 or “” indicates the inference request is not part of a sequence.
“sequence_start” : boolean value if set to true in a request indicates that the request is the first in a sequence. If not set, or set to false the request is not the first in a sequence. If set the “sequence_id” parameter must be set to a non-zero or non-empty string value.
“sequence_end” : boolean value if set to true in a request indicates that the request is the last in a sequence. If not set, or set to false the request is not the last in a sequence. If set the “sequence_id” parameter must be set to a non-zero or non-empty string value.

HTTP/REST#

The following example shows how a request is marked as part of a sequence. In this case the sequence_start and sequence_end parameters are not used which means that this request is neither the start nor end of the sequence.

POST /v2/models/mymodel/infer HTTP/1.1
Host: localhost:8000
Content-Type: application/json
Content-Length: <xx>
{
  "parameters" : { "sequence_id" : 42 }
  "inputs" : [
    {
      "name" : "input0",
      "shape" : [ 2, 2 ],
      "datatype" : "UINT32",
      "data" : [ 1, 2, 3, 4 ]
    }
  ],
  "outputs" : [
    {
      "name" : "output0",
    }
  ]
}

The example below uses a v4 UUID string as the value for the “sequence_id” parameter.

POST /v2/models/mymodel/infer HTTP/1.1
Host: localhost:8000
Content-Type: application/json
Content-Length: <xx>
{
  "parameters" : { "sequence_id" : "e333c95a-07fc-42d2-ab16-033b1a566ed5" }
  "inputs" : [
    {
      "name" : "input0",
      "shape" : [ 2, 2 ],
      "datatype" : "UINT32",
      "data" : [ 1, 2, 3, 4 ]
    }
  ],
  "outputs" : [
    {
      "name" : "output0",
    }
  ]
}

GRPC#

In addition to supporting the sequence parameters described above, the GRPC API adds a streaming version of the inference API to allow a sequence of inference requests to be sent over the same GRPC stream. This streaming API is not required to be used for requests that specify a sequence_id and may be used by requests that do not specify a sequence_id. The ModelInferRequest is the same as for the ModelInfer API. The ModelStreamInferResponse message is shown below.

service GRPCInferenceService
{
  …

  // Perform inference using a specific model with GRPC streaming.
  rpc ModelStreamInfer(stream ModelInferRequest) returns (stream ModelStreamInferResponse) {}
}

// Response message for ModelStreamInfer.
message ModelStreamInferResponse
{
  // The message describing the error. The empty message
  // indicates the inference was successful without errors.
  String error_message = 1;

  // Holds the results of the request.
  ModelInferResponse infer_response = 2;
}