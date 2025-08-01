In many cases, users will only need to use the receive_cuda_stream ( C++ / Python ) method provided by InputContext in their compute method. This is because the method automatically manages multiple aspects of stream handling:

It automatically synchronizes any streams found on the named input port to the operator’s internal CUDA stream

The first time compute is called, an operator’s internal CUDA stream would be allocated from the assigned CudaStreamPool . The same stream is then reused on all subsequent compute calls.

There is a boolean flag which can also force synchronization to the default stream (false by default)

It returns the cudaStream_t corresponding to the operator’s internal stream.

The user should use this returned stream for any kernels or memory copy operations to be run on a non-default stream.

It sets the CUDA device corresponding to the stream returned in step 2 as the active CUDA device This method automatically configures all output ports to emit the stream returned by step 2 as a component in each message sent.

This ID will allow downstream operators to know what stream was used for any data received in this message.

Attention Please insure that, for a given input port, receive is always called before receive_cuda_stream . This is necessary because the receive call is what actually receives the messages and allows the operator to know about any stream IDs found in messages on the input port. That receive method only records information internally about any streams that were found. The subsequent receive_cuda_stream call is needed to perform synchronization and return the cudaStream_t to which any input streams were synchronized.

Here is an example of the typical usage of this method from the built-in BayerDemosaicOp

C++

Python Copy Copied! // The code below would appear within `Operator::compute` // Process input message auto maybe_message = op_input.receive<gxf::Entity>("receiver"); if (!maybe_message || maybe_message.value().is_null()) { throw std::runtime_error("No message available"); } auto in_message = maybe_message.value(); // Get the CUDA stream from the input message if present, otherwise generate one. // This stream will also be transmitted on the "tensor" output port. cudaStream_t cuda_stream = op_input.receive_cuda_stream("receiver", // input port name true, // allocate false); // sync_to_default // assign the CUDA stream to the NPP stream context npp_stream_ctx_.hStream = cuda_stream; Note that BayerDemosaicOp is implemented in C++ using code shown in the C++ tab, but this shows how the equivalent code would look in the Python API. Copy Copied! # The code below would appear within `Operator.compute` # Process input message in_message = op_input.receive("receiver") if in_message is None: raise RuntimeError("No message available") # Get the CUDA stream from the input message if present, otherwise generate one. # This stream will also be transmitted on the "tensor" output port. cuda_stream_ptr = op_input.receive_cuda_stream("receiver", allocate=True, sync_to_default=False) # can then use cuda_stream_ptr to create a `cupy.cuda.ExternalStream` context, for example

It can be seen that the call to receive occurs prior to the call to receive_cuda_stream for the “receiver” input port as required. Also note that unlike for the legacy CudaStreamHandler utility class, it is not required to use gxf::Entity in the “receive” call. That type is use by some built-in operators like BayerDemosaicOp as a way to support both the nvidia::gxf::VideoBuffer type and the usual Tensor type as inputs. If only Tensor was supported we could have used receive<std::shared_ptr<Tensor>> or receive<TensorMap> instead.

The second boolean argument to receive_cuda_stream defaults to true and indicates that the operator should allocate its own internal stream. This could be set to false to not allow the operator to allocate its own internal stream from the stream pool. See the note below on the details of how receive_cuda_stream behaves in that case.

There is also an optional third argument to receive_cuda_stream which is a boolean specifying whether synchronization of the input streams (and internal stream) to CUDA’s default stream should also be performed. This option is false by default.

The above description of receive_cuda_stream is accurate when a CudaStreamPool has been passed to the operator in one of the ways described above. See the note below for additional detail on how this method operates if the operator is unable to allocate an internal stream because a CudaStreamPool was unavailable.

Python applications converting between Holoscan’s Tensor and 3rd party tensor objects often use the CUDA Array Interface. This interface by default performs its own explicit synchronization (described here). This may be unnecessary when using receive_cuda_stream which already synchronizes streams found on the input with the operator’s internal stream. The environment variable CUPY_CUDA_ARARAY_INTERFACE_SYNC can be set to 0 to disable an additional synchronization by CuPy when creating a CUDA array from a holoscan Tensor via the array interface. Similarly, HOLOSCAN_CUDA_ARRAY_INTERFACE_SYNC can be set to 0 to disable synchronization by the array interface on the Holoscan side when creating a Holoscan tensor from a 3rd party tensor.

This section describes the behavior of receive_cuda_stream in the case where no streams are available in the operator’s CudaStreamPool (or the allocate argument of receive_cuda_stream was set to false). In this case, receive_cuda_stream will not be able to allocate a dedicated internal stream for the operator’s own use. Instead, the cudaStream_t corresponding to the first stream found on the named input port will be returned and any additional streams on that input port would be synchronized to it. If a subsequent receive_cuda_stream call was made for another input port, any streams found on that second port are synchronized to the cudaStream_t that was returned by the first receive_cuda_stream call and the stream returned is that same cudaStream_t . In other words, the first stream found on the initial call to receive_cuda_stream will be repurposed as the operator’s internal stream to which any other input streams are synchronized. This same stream will also be the one automatically emitted on the output ports.