nvidia.dali.plugin.numba.fn.experimental.numba_function#

nvidia.dali.plugin.numba.fn.experimental.numba_function(__input_0, __input_1=None, __input_2=None, __input_3=None, __input_4=None, __input_5=None, /, *, batch_processing=False, blocks=None, bytes_per_sample_hint=[0], in_types, ins_ndim, out_types, outs_ndim, preserve=False, run_fn, setup_fn=None, threads_per_block=None, device=None, name=None)#

Invokes a njit compiled Numba function.

The run function should be a Python function that can be compiled in Numba nopython mode. A function taking a single input and producing a single output should follow the following definition:

def run_fn(out0, in0)

where out0 and in0 are numpy array views of the input and output tensors. If the operator is configured to run in batch mode, then the first dimension of the arrays is the sample index.

Note that the function can take at most 6 inputs and 6 outputs.

Additionally, an optional setup function calculating the shape of the output so DALI can allocate memory for the output with the following definition:

def setup_fn(outs, ins)

The setup function is invoked once for the whole batch. The first dimension of outs, ins is the number of outputs/inputs, respectively. The second dimension is the sample index. For example, the first sample on the second output can be accessed by outs[1][0].

If no setup function provided, the output shape and data type will be the same as the input.

Note

This operator is experimental and its API might change without notice.

Warning

When the pipeline has conditional execution enabled, additional steps must be taken to prevent the run_fn and setup_fn functions from being rewritten by AutoGraph. There are two ways to achieve this:

Define the functions at global scope (i.e. outside of pipeline_def scope).

If functions are a result of another “factory” function, then the factory function must be defined outside pipeline definition function and decorated with @do_not_convert.

More details can be found in @do_not_convert documentation.

Example 1:

The following example shows a simple setup function which permutes the order of dimensions in the shape.

def setup_change_out_shape(outs, ins):
    out0 = outs[0]
    in0 = ins[0]
    perm = [1, 0, 2]
    for sample_idx in range(len(out0)):
        for d in range(len(perm)):
            out0[sample_idx][d] = in0[sample_idx][perm[d]]

Since the setup function is running for the whole batch, we need to iterate and permute each sample’s shape individually. For shapes = [(10, 20, 30), (20, 10, 30)] it will produce output with shapes = [(20, 10, 30), (10, 20, 30)].

Also lets provide run function:

def run_fn(out0, in0):
    for i in range(in0.shape[0]):
        for j in range(in0.shape[1]):
            out0[j, i] = in0[i, j]

The run function can work per-sample or per-batch, depending on the batch_processing argument.

A run function working per-batch may look like this:

def run_fn(out0_samples, in0_samples):
    for out0, in0 in zip(out0_samples, in0_samples):
        for i in range(in0.shape[0]):
            for j in range(in0.shape[1]):
                out0[j, i] = in0[i, j]

A run function working per-sample may look like this:

def run_fn(out0, in0):
    for i in range(in0.shape[0]):
        for j in range(in0.shape[1]):
            out0[j, i] = in0[i, j]

This operator allows sequence inputs and supports volumetric data.

This operator will not be optimized out of the graph.

Supported backends

‘cpu’
‘gpu’

Parameters:

__input_0¶ (TensorList) – Input to the operator.
__input_¶[1..5] (TensorList, optional) – This function accepts up to 5 optional positional inputs

Keyword Arguments:

batch_processing¶ (bool, optional, default = False) –
Determines whether the function is invoked once per batch or separately for each sample in the batch.

When batch_processing is set to True, the function processes the whole batch. It is necessary if the function has to perform cross-sample operations and may be beneficial if significant part of the work can be reused. For other use cases, specifying False and using per-sample processing function allows the operator to process samples in parallel.
blocks¶ (int or list of int, optional) –

3-item list specifying the number of blocks per grid used to
execute a CUDA kernel
bytes_per_sample_hint¶ (int or list of int, optional, default = [0]) –
Output size hint, in bytes per sample.

If specified, the operator’s outputs residing in GPU or page-locked host memory will be preallocated to accommodate a batch of samples of this size.
in_types¶ (DALIDataType or list of DALIDataType) – Types of inputs.
ins_ndim¶ (int or list of int) – Number of dimensions which inputs shapes should have.
out_types¶ (DALIDataType or list of DALIDataType) – Types of outputs.
outs_ndim¶ (int or list of int) – Number of dimensions which outputs shapes should have.
preserve¶ (bool, optional, default = False) – Prevents the operator from being removed from the graph even if its outputs are not used.
run_fn¶ (object) – Function to be invoked. This function must work in Numba nopython mode.
setup_fn¶ (object, optional) – Setup function setting shapes for outputs. This function is invoked once per batch. Also this function must work in Numba nopython mode.
threads_per_block¶ (int or list of int, optional) –

3-item list specifying the number of threads per
block used to execute a CUDA kernel