The latency measurement tool operates by having a producer component generate a sequence of known video frames that are output and then transferred back to an input consumer component using a physical loopback cable. Timestamps are compared throughout the life of the frame to measure the overall latency that the frame sees during this process, and these results are summarized when all of the frames have been received and the measurement completes. See Producers, Consumers, and Example Configurations for more details.

Each frame that is generated by the tool goes through the following steps in order, each of which has its time measured and then reported when all frames complete.

Fig. 27 Latency Tool Frame Lifespan (RDMA Disabled)

CUDA Processing In order to simulate a real-world GPU workload, the tool first runs a CUDA kernel for a user-specified amount of loops (defaults to zero). This step is described below in Simulating GPU Workload. Render on GPU After optionally simulating a GPU workload, every producer then generates its frames using the GPU, either by a common CUDA kernel or by another method that is available to the producer’s API (such as the OpenGL producer). This step is expected to be very fast (<100us), but higher times may be seen if overall system load is high. Copy To Host Once the frame has been generated on the GPU, it may be necessary to copy the frame to host memory in order for the frame to be output by the producer component (for example, an AJA producer with RDMA disabled). If a host copy is not required (i.e. RDMA is enabled for the producer), this time should be zero. Write to HW Some producer components require frames to be copied to peripheral memory before they can be output (for example, an AJA producer requires frames to be copied to the external frame stores on the AJA device). This copy may originate from host memory if RDMA is disabled for the producer, or from GPU memory if RDMA is enabled. If this copy is not required, e.g. the producer outputs directly from the GPU, this time should be zero. VSync Wait Once the frame is ready to be output, the producer hardware must wait for the next VSync interval before the frame can be output. The sum of this VSync wait and all of the preceding steps is expected to be near a multiple of the frame interval. For example, if the frame rate is 60Hz then the sum of the times for steps 1 through 5 should be near a multiple of 16666us. Wire Time The wire time is the amount of time that it takes for the frame to transfer across the physical loopback cable. This should be near the time for a single frame interval. Read From HW Once the frame has been transferred across the wire and is available to the consumer, some consumer components require frames to be copied from peripheral memory into host (RDMA disabled) or GPU (RDMA enable) memory. For example, an AJA consumer requires frames to be copied from the external frame store of the AJA device. If this copy is not required, e.g. the consumer component writes received frames directly to host/GPU memory, this time should be zero. Copy to GPU If the consumer received the frame into host memory, the final step required for processing the frame with the GPU is to copy the frame into GPU memory. If RDMA is enabled for the consumer and the frame was previously written directly to GPU memory, this time should be zero.

Note that if RDMA is enabled on the producer and consumer sides then the GPU/host copy steps above, 3 and 8 respectively, are effectively removed since RDMA will copy directly between the video HW and the GPU. The following shows the same diagram as above but with RDMA enabled for both the producer and consumer.

Fig. 28 Latency Tool Frame Lifespan (RDMA Enabled)

The following shows example output of the above measurements from the tool when testing a 4K stream at 60Hz from an AJA producer to an AJA consumer, both with RDMA disabled, and no GPU/CUDA workload simulation. Note that all time values are given in microseconds.

$ ./loopback-latency -p aja -p.rdma 0 -c aja -c.rdma 0 -f 4k





While this tool measures the producer times followed by the consumer times, the expectation for real-world video processing applications is that this order would be reversed. That is to say, the expectation for a real-world application is that it would capture, process, and output frames in the following order (with the component responsible for measuring that time within this tool given in parentheses):

Read from HW (consumer) Copy to GPU (consumer) Process Frame (producer) Render Results to GPU (producer) Copy to Host (producer) Write to HW (producer)

Fig. 29 Real Application Frame Lifespan

To illustrate this, the tool sums and displays the total producer and consumer times, then provides the Estimated Application Times as the total sum of all of these steps (i.e. steps 1 through 6, above).

(continued from above)





Once a real-world application captures, processes, and outputs a frame, it would still be required that this final output waits for the next VSync interval before it is actually sent across the physical wire to the display hardware. Using this assumption, the tool then estimates one final value for the Final Estimated Latencies by doing the following:

Take the Estimated Application Time (from above) Round it up to the next VSync interval Add the physical wire time (i.e. a frame interval)

Fig. 30 Final Estimated Latency with VSync and Physical Wire Time

Continuing this example using a frame interval of 16666us (60Hz), this means that the average Final Estimated Latency is determined by:

Average application time = 26772 Round up to next VSync interval = 33332 Add physical wire time (+16666) = 49998

These times are also reported as a multiple of frame intervals.

(continued from above)





Using this example, we should then expect that the total end-to-end latency that is seen by running this pipeline using these components and configuration is 3 frame intervals (49998us).

The previous example uses an AJA producer and consumer for a 4K @ 60Hz stream, however RDMA was disabled for both components. Because of this, the additional copies between the GPU and host memory added more than 10000us of latency to the pipeline, causing the application to exceed one frame interval of processing time per frame and therefore a total frame latency of 3 frames. If RDMA is enabled, these GPU and host copies can be avoided so the processing latency is reduced by more than 10000us. More importantly, however, this also allows the total processing time to fit within a single frame interval so that the total end-to-end latency can be reduced to just 2 frames.

Fig. 31 Reducing Latency With RDMA

The following shows the above example repeated with RDMA enabled.

$ ./loopback-latency -p aja -p.rdma 1 -c aja -c.rdma 1 -f 4k





By default the tool measures what is essentially a pass-through video pipeline; that is, no processing of the video frames is performed by the system. While this is useful for measuring the minimum latency that can be achieved by the video input and output components, it’s not very indicative of a real-world use case in which the GPU is used for compute-intensive processing operations on the video frames between the input and output — for example, an object detection algorithm that applies an overlay to the output frames.

While it may be relatively simple to measure the runtime latency of the processing algorithms that are to be applied to the video frames — by simply measuring the runtime of running the algorithm on a single or stream of frames — this may not be indicative of the effects that such processing might have on the overall system load, which may further increase the latency of the video input and output components.

In order to estimate the total latency when an additional GPU workload is added to the system, the latency tool has an -s {count} option that can be used to run an arbitrary CUDA loop the specified number of times before the producer actually generates a frame. The expected usage for this option is as follows:

The per-frame runtime of the actual GPU processing algorithm is measured outside of the latency measurement tool. The latency tool is repeatedly run with just the -s {count} option, adjusting the {count} parameter until the time that it takes to run the simulated loop approximately matches the actual processing time that was measured in the previous step. $ ./loopback-latency -s 2000

The latency tool is run with the full producer ( -p ) and consumer ( -c ) options used for the video I/O, along with the -s {count} option using the loop count that was determined in the previous step. Note The following example shows that approximately half of the frames received by the consumer were duplicate/repeated frames. This is due to the fact that the additional processing latency of the producer causes it to exceed a single frame interval, and so the producer is only able to output a new frame every second frame interval. $ ./loopback-latency -p aja -c aja -s 2000

