This section discusses key performance considerations when designing and optimizing Holoscan applications.
When designing Holoscan applications, it’s important to understand the relationship between operator granularity and scheduling overhead. The current GXF-based scheduler/executor incurs overhead for:
compute() methodFor operators with trivial computations (e.g., basic arithmetic operations like addition, multiplication), this overhead can outweigh the actual computation time.
The following measurements were obtained using the scheduler_overhead_benchmark.py script (installed to /opt/nvidia/holoscan/bin/ or available at scripts/ in the source tree). These values are illustrative and will vary depending on your hardware, OS, and Python environment—run the benchmark on your target system to obtain accurate figures:
The measurements above were obtained using the following configuration:
Script: scripts/scheduler_overhead_benchmark.py (installed to /opt/nvidia/holoscan/bin/scheduler_overhead_benchmark.py)
Command-line flags:
Timing methodology: The benchmark wraps time.perf_counter() around the entire run_app() call, which includes application initialization (constructor, compose()) and teardown—not just compute() execution. At high iteration counts (e.g., 100,000), this overhead is amortized, but for lower iteration counts the per-iteration time will appear inflated.
Warmup behavior: The warmup phase always runs with num_workers=2 for the event-based scheduler, regardless of the --workers argument. This ensures consistent JIT/cache warm-up but means warmup conditions may differ from the actual benchmark when testing with --workers 1 or --workers 4.
Python-specific: The event-based scheduler shows increased overhead with multiple worker threads in Python applications due to the Global Interpreter Lock (GIL) and thread synchronization. C++ pipelines do not incur GIL contention. We will update this guidance as more measurements become available.
EventBasedScheduler, you can isolate the scheduler’s dispatcher thread with GXF_EBS_DISPATCHER_CPU_CORE=<core-id>. This is separate from the scheduler pin_cores parameter, which only affects worker threads, and can reduce jitter when the dispatcher competes with time-critical work.To reproduce the measurements on your system, run the benchmark script:
Use --help for options such as iteration count and worker thread settings.
Pinning Holoscan worker threads to specific cores via pin_cores (or the dispatcher via GXF_EBS_DISPATCHER_CPU_CORE) only controls where those threads run. The kernel is still free to schedule unrelated user and kernel work onto the same cores, which reintroduces jitter that real-time configuration is meant to eliminate. Removing the chosen cores from the kernel’s general scheduling pool at boot closes that gap.
For an end-to-end worked example of CPU core isolation on a real platform, see the Holohub high-performance networking tutorial, section 3.5 “Isolate CPU Cores”.
The following kernel command-line parameters work together:
isolcpus=<list>
pin_cores, taskset, or sched_setaffinity).nohz_full=<list>
isolcpus, but ensure each nohz_full core hosts only one runnable RT task. Skip it if you have not observed tick-driven jitter in profiling traces; the benefit is small for workloads already dominated by GPU or I/O latency.rcu_nocbs=<list>
isolcpus / nohz_full. This is generally cheap to enable and worth doing whenever you isolate cores.These three parameters are typically used with the same core list.
performance governor pins each selected core to its maximum operating frequency, eliminating frequency-scaling and wake-from-idle latency that the default ondemand / schedutil governors introduce.performance is unnecessary.To target only specific cores:
The Scheduler Recipe Multi Branch Low Latency recipe pins the dispatcher to core 1, the priority branch worker to core 2, the other branch workers to cores 3 and 4, and the default pool to cores 5 and 6. Isolate the full set of cores the recipe uses by adding the following to the kernel command line:
On x86 hosts, this is set in /etc/default/grub (the GRUB_CMDLINE_LINUX line) followed by update-grub. On IGX and Jetson, it is set in /boot/extlinux/extlinux.conf under the APPEND line of the active boot entry. Refer to your platform’s documentation for the exact procedure and reboot requirements.
After rebooting, set the governor and launch the application with the dispatcher pinned to core 1:
Confirm the kernel sees the isolated set:
Confirm a running thread is pinned where you expect (replace <pid> with a worker thread ID from ps -eLo pid,tid,comm | grep my_app):
Leave at least one core unisolated. The kernel, userspace shell, container runtime, and unpinned Holoscan threads all need somewhere to run. Isolating every core (e.g. isolcpus=0-N on an N+1-core system) starves the kernel of a general-purpose CPU and can hang the system. Reserve core 0 (and ideally one more) for the OS.
nohz_full requires CONFIG_NO_HZ_FULL=y in the kernel, which is set on most stock distributions but not all. If isolation is configured but jitter persists, check cat /sys/devices/system/cpu/nohz_full — an empty result means the option was ignored.