Class InferContext

Inheritance Relationships

Derived Types

Class Documentation

class InferContext

An InferContext object is used to run inference on an inference server for a specific model.

Once created an InferContext object can be used repeatedly to perform inference using the model. Options that control how inference is performed can be changed in between inference runs.

A InferContext object can use either HTTP protocol or gRPC protocol depending on the Create function (InferHttpContext::Create or InferGrpcContext::Create). For example:

std::unique_ptr<InferContext> ctx;
InferHttpContext::Create(&ctx, "localhost:8000", "mnist");
...
std::unique_ptr<Options> options0;
Options::Create(&options0);
options->SetBatchSize(b);
options->AddClassResult(output, topk);
ctx->SetRunOptions(*options0);
...
ctx->Run(&results0);  // run using options0
ctx->Run(&results1);  // run using options0
...
std::unique_ptr<Options> options1;
Options::Create(&options1);
options->AddRawResult(output);
ctx->SetRunOptions(*options);
...
ctx->Run(&results2);  // run using options1
ctx->Run(&results3);  // run using options1
...

Note
InferContext::Create methods are thread-safe. All other InferContext methods, and nested class methods are not thread-safe.
The Run() calls are not thread-safe but a new Run() can be invoked as soon as the previous completes. The returned result objects are owned by the caller and may be retained and accessed even after the InferContext object is destroyed.
AsyncRun() and GetAsyncRunStatus() calls are not thread-safe. What’s more, calling one method while the other one is running will result in undefined behavior given that they will modify the shared data internally.
For more parallelism multiple InferContext objects can access the same inference server with no serialization requirements across those objects.

Subclassed by nvidia::inferenceserver::client::InferGrpcContext, nvidia::inferenceserver::client::InferHttpContext

Public Functions

virtual ~InferContext()

Destroy the inference context.

const std::string &ModelName() const

Return
The name of the model being used for this context.

int ModelVersion() const

Return
The version of the model being used for this context. -1 indicates that the latest (i.e. highest version number) version of that model is being used.

uint64_t MaxBatchSize() const

Return
The maximum batch size supported by the context. A maximum batch size indicates that the context does not support batching and so only a single inference at a time can be performed.

const std::vector<std::shared_ptr<Input>> &Inputs() const

Return
The inputs of the model.

const std::vector<std::shared_ptr<Output>> &Outputs() const

Return
The outputs of the model.

Error GetInput(const std::string &name, std::shared_ptr<Input> *input) const

Get a named input.

Return
Error object indicating success or failure.
Parameters
  • name: The name of the input.
  • input: Returns the Input object for ‘name’.

Error GetOutput(const std::string &name, std::shared_ptr<Output> *output) const

Get a named output.

Return
Error object indicating success or failure.
Parameters
  • name: The name of the output.
  • output: Returns the Output object for ‘name’.

Error SetRunOptions(const Options &options)

Set the options to use for all subsequent Run() invocations.

Return
Error object indicating success or failure.
Parameters
  • options: The options.

Error GetStat(Stat *stat)

Get the current statistics of the InferContext.

Return
Error object indicating success or failure.
Parameters
  • stat: Returns the Stat object holding the statistics.

virtual Error Run(std::vector<std::unique_ptr<Result>> *results) = 0

Send a synchronous request to the inference server to perform an inference to produce results for the outputs specified in the most recent call to SetRunOptions().

The Result objects holding the output values are returned in the same order as the outputs are specified in the options.

Return
Error object indicating success or failure.
Parameters
  • results: Returns Result objects holding inference results.

virtual Error AsyncRun(std::shared_ptr<Request> *async_request) = 0

Send an asynchronous request to the inference server to perform an inference to produce results for the outputs specified in the most recent call to SetRunOptions().

Return
Error object indicating success or failure.
Parameters
  • async_request: Returns a Request object that can be used to retrieve the inference results for the request.

virtual Error GetAsyncRunResults(std::vector<std::unique_ptr<Result>> *results, const std::shared_ptr<Request> &async_request, bool wait) = 0

Get the results of the asynchronous request referenced by ‘async_request’.

The Result objects holding the output values are returned in the same order as the outputs are specified in the options when AsyncRun() was called.

Return
Error object indicating success or failure. Success will be returned only if the request has been completed succesfully. UNAVAILABLE will be returned if ‘wait’ is false and the request is not ready.
Parameters
  • results: Return Result objects holding inference results.
  • async_request: Request handle to retrieve results.
  • wait: If true, block until the request completes. Otherwise, return immediately.

Error GetReadyAsyncRequest(std::shared_ptr<Request> *async_request, bool wait)

Get any one completed asynchronous request.

Return
Error object indicating success or failure. Success will be returned only if a completed request was returned.. UNAVAILABLE will be returned if ‘wait’ is false and no request is ready.
Parameters
  • async_request: Returns the Request object holding the completed request.
  • wait: If true, block until the request completes. Otherwise, return immediately.

Protected Types

using AsyncReqMap = std::map<uintptr_t, std::shared_ptr<Request>>

Protected Functions

InferContext(const std::string&, int, bool)
virtual void AsyncTransfer() = 0
virtual Error PreRunProcessing(std::shared_ptr<Request> &request) = 0
Error IsRequestReady(const std::shared_ptr<Request> &async_request, bool wait)
Error UpdateStat(const RequestTimers &timer)

Protected Attributes

AsyncReqMap ongoing_async_requests_
const std::string model_name_
const int model_version_
const bool verbose_
uint64_t max_batch_size_
uint64_t total_input_byte_size_
uint64_t batch_size_
uint64_t async_request_id_
std::vector<std::shared_ptr<Input>> inputs_
std::vector<std::shared_ptr<Output>> outputs_
InferRequestHeader infer_request_
std::vector<std::shared_ptr<Output>> requested_outputs_
std::shared_ptr<Request> sync_request_
Stat context_stat_
std::thread worker_
std::mutex mutex_
std::condition_variable cv_
bool exiting_
class Input

An input to the model.

Public Functions

virtual ~Input()

Destroy the input.

virtual const std::string &Name() const = 0

Return
The name of the input.

virtual size_t ByteSize() const = 0

Return
The size in bytes of this input. This is the size for one instance of the input, not the entire size of a batched input.

virtual DataType DType() const = 0

Return
The data-type of the input.

virtual ModelInput::Format Format() const = 0

Return
The format of the input.

virtual const DimsList &Dims() const = 0

Return
The dimensions/shape of the input.

virtual Error Reset() = 0

Prepare this input to receive new tensor values.

Forget any existing values that were set by previous calls to SetRaw().

Return
Error object indicating success or failure.

virtual Error SetRaw(const uint8_t *input, size_t input_byte_size) = 0

Set tensor values for this input from a byte array.

The array is not copied and so it must not be modified or destroyed until this input is no longer needed (that is until the Run() call(s) that use the input have completed). For batched inputs this function must be called batch-size times to provide all tensor values for a batch of this input.

Return
Error object indicating success or failure.
Parameters
  • input: The pointer to the array holding the tensor value.
  • input_byte_size: The size of the array in bytes, must match the size expected by the input.

virtual Error SetRaw(const std::vector<uint8_t> &input) = 0

Set tensor values for this input from a byte vector.

The vector is not copied and so it must not be modified or destroyed until this input is no longer needed (that is until the Run() call(s) that use the input have completed). For batched inputs this function must be called batch-size times to provide all tensor values for a batch of this input.

Return
Error object indicating success or failure.
Parameters
  • input: The vector holding tensor values.

class Options

Run options to be applied to all subsequent Run() invocations.

Public Functions

virtual ~Options()
virtual uint64_t CorrelationId() const = 0

Return
The correlation ID to use for all subsequent inferences. A value of 0 indicates the subsequent inferences have no correlation ID.

virtual void SetCorrelationId(uint64_t correlation_id) = 0

Set the correlation ID to use for all subsequent inferences.

Set to 0 to indicate that subsequent inferences should have no correlation ID.

Parameters
  • correlation_id: The correlation ID.

virtual size_t BatchSize() const = 0

Return
The batch size to use for all subsequent inferences.

virtual void SetBatchSize(size_t batch_size) = 0

Set the batch size to use for all subsequent inferences.

Parameters
  • batch_size: The batch size.

virtual Error AddRawResult(const std::shared_ptr<InferContext::Output> &output) = 0

Add ‘output’ to the list of requested RAW results.

Run() will return the output’s full tensor as a result.

Return
Error object indicating success or failure.
Parameters
  • output: The output.

virtual Error AddClassResult(const std::shared_ptr<InferContext::Output> &output, uint64_t k) = 0

Add ‘output’ to the list of requested CLASS results.

Run() will return the highest ‘k’ values of ‘output’ as a result.

Return
Error object indicating success or failure.
Parameters
  • output: The output.
  • k: Set how many class results to return for the output.

Public Static Functions

static Error Create(std::unique_ptr<Options> *options)

Create a new Options object with default values.

Return
Error object indicating success or failure.

class Output

An output from the model.

Public Functions

virtual ~Output()

Destroy the output.

virtual const std::string &Name() const = 0

Return
The name of the output.

virtual size_t ByteSize() const = 0

Return
The size in bytes of this output. This is the size for one instance of the output, not the entire size of a batched input.

virtual DataType DType() const = 0

Return
The data-type of the output.

virtual const DimsList &Dims() const = 0

Return
The dimensions/shape of the output.

class Request

Handle to a inference request.

The request handle is used to get request results if the request is sent by AsyncRun().

Public Functions

virtual ~Request()

Destroy the request handle.

virtual uint64_t Id() const = 0

Return
The unique identifier of the request.

class RequestTimers

Timer to record the timestamp for different stages of request handling.

Public Types

enum Kind

The kind of the timer.

Values:

REQUEST_START

The start of request handling.

REQUEST_END

The end of request handling.

SEND_START

The start of sending request bytes to the server (i.e. first byte).

SEND_END

The end of sending request bytes to the server (i.e. last byte).

RECEIVE_START

The start of receiving response bytes from the server (i.e.

first byte).

RECEIVE_END

The end of receiving response bytes from the server (i.e.

last byte).

Public Functions

RequestTimers()

Construct a timer with zero-ed timestamps.

Error Reset()

Reset all timestamp values to zero.

Must be called before re-using the timer.

Return
Error object indicating success or failure.

Error Record(Kind kind)

Record the current timestamp for a request stage.

Return
Error object indicating success or failure.
Parameters
  • kind: The Kind of the timestamp.

class Result

An inference result corresponding to an output.

Public Types

enum ResultFormat

Format in which result is returned.

Values:

RAW = 0

RAW format is the entire result tensor of values.

CLASS = 1

CLASS format is the top-k highest probability values of the result and the associated class label (if provided by the model).

Public Functions

virtual ~Result()

Destroy the result.

virtual const std::string &ModelName() const = 0

Return
The name of the model that produced this result.

virtual uint32_t ModelVersion() const = 0

Return
The version of the model that produced this result.

virtual const std::shared_ptr<Output> GetOutput() const = 0

Return
The Output object corresponding to this result.

virtual Error GetRaw(size_t batch_idx, const std::vector<uint8_t> **buf) const = 0

Get a reference to entire raw result data for a specific batch entry.

Returns error if this result is not RAW format.

Return
Error object indicating success or failure.
Parameters
  • batch_idx: Returns the results for this entry of the batch.
  • buf: Returns the vector of result bytes.

virtual Error GetRawAtCursor(size_t batch_idx, const uint8_t **buf, size_t adv_byte_size) = 0

Get a reference to raw result data for a specific batch entry at the current “cursor” and advance the cursor by the specified number of bytes.

More typically use GetRawAtCursor<T>() method to return the data as a specific type T. Use ResetCursor() to reset the cursor to the beginning of the result. Returns error if this result is not RAW format.

Return
Error object indicating success or failure.
Parameters
  • batch_idx: Returns results for this entry of the batch.
  • buf: Returns pointer to ‘adv_byte_size’ bytes of data.
  • adv_byte_size: The number of bytes of data to get a reference to.

template <typename T>
Error GetRawAtCursor(size_t batch_idx, T *out)

Read a value for a specific batch entry at the current “cursor” from the result tensor as the specified type T and advance the cursor.

Use ResetCursor() to reset the cursor to the beginning of the result. Returns error if this result is not RAW format.

Return
Error object indicating success or failure.
Parameters
  • batch_idx: Returns results for this entry of the batch.
  • out: Returns the value at the cursor.

virtual Error GetClassCount(size_t batch_idx, size_t *cnt) const = 0

Get the number of class results for a batch.

Returns error if this result is not CLASS format.

Return
Error object indicating success or failure.
Parameters
  • batch_idx: The index in the batch.
  • cnt: Returns the number of ClassResult entries for the batch entry.

virtual Error GetClassAtCursor(size_t batch_idx, ClassResult *result) = 0

Get the ClassResult result for a specific batch entry at the current cursor.

Use ResetCursor() to reset the cursor to the beginning of the result. Returns error if this result is not CLASS format.

Return
Error object indicating success or failure.
Parameters
  • batch_idx: The index in the batch.
  • result: Returns the ClassResult value for the batch at the cursor.

virtual Error ResetCursors() = 0

Reset cursor to beginning of result for all batch entries.

Return
Error object indicating success or failure.

virtual Error ResetCursor(size_t batch_idx) = 0

Reset cursor to beginning of result for specified batch entry.

Return
Error object indicating success or failure.
Parameters
  • batch_idx: The index in the batch.

struct ClassResult

The result value for CLASS format results.

Public Members

size_t idx

The index of the class in the result vector.

float value

The value of the class.

std::string label

The label for the class, if provided by the model.

struct Stat

Cumulative statistic of the InferContext.

Note
For gRPC protocol, ‘cumulative_send_time_ns’ represents the time for marshaling infer request. ‘cumulative_receive_time_ns’ represents the time for unmarshaling infer response.

Public Functions

Stat()

Create a new Stat object with zero-ed statistics.

Public Members

size_t completed_request_count

Total number of requests completed.

uint64_t cumulative_total_request_time_ns

Time from the request start until the response is completely received.

uint64_t cumulative_send_time_ns

Time from the request start until the last byte is sent.

uint64_t cumulative_receive_time_ns

Time from receiving first byte of the response until the response is completely received.