Post Installation Configuration#
To avoid any issues copying the content of the configuration files from the documentation, download the archive from the Holoscan for Media Platform Setup resource in the NGC catalog linked from the Holoscan for Media collection.
Worker Node Tuning#
Create MachineConfigPool#
Skip this section if you are on a compact (3-node) cluster or SNO. This step is specific to 5-node clusters because a new MCP is created, which inherits from the worker MCP. In a 3-node cluster or SNO, all configuration is applied to the master MCP itself.
Label all worker nodes with a custom label. For example,
node-role.kubernetes.io/holoscanmedia=oc label node <worker-node> node-role.kubernetes.io/holoscanmedia=
Create
holoscanmedia_mcp.yamlwith the following content:apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: labels: holoscanmedia: '' machineconfiguration.openshift.io/role: holoscanmedia name: holoscanmedia spec: machineConfigSelector: matchExpressions: - key: machineconfiguration.openshift.io/role operator: In values: - worker - holoscanmedia nodeSelector: matchLabels: node-role.kubernetes.io/holoscanmedia: ''
Apply the machine config pool:
oc create -f holoscanmedia_mcp.yaml
Create Performance Profile#
Performance Profile is used to tune the cluster nodes for high performance, low latency workloads.
Create
holoscanmedia_profile.yamlbased on the following template:apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: name: irq-performanceprofile spec: globallyDisableIrqLoadBalancing: true additionalKernelArgs: - nmi_watchdog=0 - audit=0 - mce=ignore_ce - processor.max_cstate=0 - idle=poll - intel_idle.max_cstate=0 - intel_pstate=disable - nosmt - nosoftlockup - cpufreq.default_governor=performance - module_blacklist=irdma - nokaslr cpu: isolated: 8-63 reserved: 0-7 hugepages: defaultHugepagesSize: 2M pages: - count: 32768 size: 2M numa: topologyPolicy: single-numa-node nodeSelector: node-role.kubernetes.io/<machine-config-pool>: '' realTimeKernel: enabled: false
The additional kernel arguments are derived from the Rivermax Linux Performance Tuning Guide (or find the latest version at Rivermax Getting Started).
Replace
<machine-config-pool>withholoscanmediafor a 5-node cluster andmasterfor a 3-node cluster or SNO.Reserve 4–8 CPU cores for the system and isolate the rest of the CPUs for application pods. Modify the
isolatedandreservedfields in the above file appropriately.NUMA-aware scheduling is required for maximum performance. However, for servers with fewer network adapters or GPUs, or an unbalanced topology, add
numa=offto theadditionalKernelArgsand remove thenumasection to enable the utilization of available resources.
Important
After configuring the Topology Manager
single-numa-nodepolicy, using thedefault-schedulerfor pods that request NUMA-aligned resources can cause runaway pod creation errors (ContainerStatusUnknown). Follow the steps to install and use the NUMA-aware scheduler.Apply the performance profile:
oc create -f holoscanmedia_profile.yaml
This reboots the worker nodes one by one.
You can monitor the status using the following command. Wait for the
UPDATEDcolumn to becomeTrue. This step can take 30–45 minutes to complete.oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE holoscanmedia rendered-holoscanmedia-7e2a6af0d84770766c9246dffdf55a40 False True False 2 0 0 0 100m master rendered-master-fd62216b1239cd90a6a3cd7c3b9e213d True False False 3 3 3 0 4h worker rendered-worker-7e2a6af0d84770766c9246dffdf55a40 True False False 0 0 0 0 4h
After the servers comes back online, check hugepages allocation from the jump node using the following command:
oc get node <worker-node> -o jsonpath="{.status.allocatable.hugepages-2Mi}" 64Gi
Hardware Tuning#
Create
holoscanmedia_tuned.yamlbased on the following template:apiVersion: tuned.openshift.io/v1 kind: Tuned metadata: name: holoscanmedia-tuning-profile namespace: openshift-cluster-node-tuning-operator spec: profile: - data: | [main] summary=Boot time configuration for Holoscan for Media include=openshift-node-performance-irq-performanceprofile [sysctl] vm.swappiness=0 vm.zone_reclaim_mode=0 kernel.numa_balancing=0 kernel.sched_rt_runtime_us=-1 [service] service.irqbalance=stop,disable [vm] transparent_hugepages=never name: openshift-node-holoscanmedia recommend: - machineConfigLabels: machineconfiguration.openshift.io/role: "<machine-config-pool>" priority: 10 profile: openshift-node-holoscanmedia
The additional settings in this profile are derived from the Rivermax Linux Performance Tuning Guide (or find the latest version at Rivermax Getting Started).
Replace
<machine-config-pool>withholoscanmediafor a 5-node cluster andmasterfor a 3-node cluster or SNO.
Apply the TuneD profile:
oc create -f holoscanmedia_tuned.yaml
Disable Chronyd Service#
Create
disable_chronyd.yamlbased on the following template. Replace<role>withworkerfor a 5-node cluster andmasterfor a 3-node cluster or SNO:apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: <role> name: disable-chronyd spec: config: ignition: version: 3.2.0 systemd: units: - contents: | [Unit] Description=NTP client/server Documentation=man:chronyd(8) man:chrony.conf(5) After=ntpdate.service sntp.service ntpd.service Conflicts=ntpd.service systemd-timesyncd.service ConditionCapability=CAP_SYS_TIME [Service] Type=forking PIDFile=/run/chrony/chronyd.pid EnvironmentFile=-/etc/sysconfig/chronyd ExecStart=/usr/sbin/chronyd $OPTIONS ExecStartPost=/usr/libexec/chrony-helper update-daemon PrivateTmp=yes ProtectHome=yes ProtectSystem=full [Install] WantedBy=multi-user.target enabled: false name: "chronyd.service"
Apply the machine config:
oc create -f disable_chronyd.yaml
This command reboots the worker nodes. Wait for 15-20 minutes for the nodes to come back up before proceeding.
Ensure
READYMACHINECOUNTshows the total count of worker nodes:oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE holoscanmedia rendered-holoscanmedia-7e2a6af0d84770766c9246dffdf55a40 False True False 2 0 0 0 100m master rendered-master-fd62216b1239cd90a6a3cd7c3b9e213d True False False 3 3 3 0 4h worker rendered-worker-7e2a6af0d84770766c9246dffdf55a40 True False False 0 0 0 0 4h
Disable PCIe ACS#
PCIe Access Control Services (ACS) must be disabled to maximize achievable bandwidth.
Create
disable_acs.yamlbased on the following template. Replace<role>withworkerfor a 5-node cluster andmasterfor a 3-node cluster or SNO:apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: <role> name: disable-acs spec: config: ignition: version: 3.4.0 storage: files: - path: /usr/local/bin/disable-acs.sh contents: source: data:;charset=utf-8;base64,IyEvYmluL2Jhc2gKIyBtdXN0IGJlIHJvb3QgdG8gYWNjZXNzIGV4dGVuZGVkIFBDSSBjb25maWcgc3BhY2UKZWNobyAiU3RhcnRpbmcgdGhlIHNjcmlwdCIKaWYgWyAiJEVVSUQiIC1uZSAwIF07IHRoZW4KICBlY2hvICJFUlJPUjogJDAgbXVzdCBiZSBydW4gYXMgcm9vdCIKICBleGl0IDEKZmkKZm9yIEJERiBpbiBgbHNwY2kgLWQgIio6KjoqIiB8IGF3ayAne3ByaW50ICQxfSdgOyBkbwogICAgIyBza2lwIGlmIGl0IGRvZXNuJ3Qgc3VwcG9ydCBBQ1MKICAgIHNldHBjaSAtdiAtcyAke0JERn0gRUNBUF9BQ1MrMHg2LncgPiAvZGV2L251bGwgMj4mMQogICAgaWYgWyAkPyAtbmUgMCBdOyB0aGVuCiAgICAgICAgICAgICNlY2hvICIke0JERn0gZG9lcyBub3Qgc3VwcG9ydCBBQ1MsIHNraXBwaW5nIgogICAgICAgICAgICBjb250aW51ZQogICAgZmkKICAgIGxvZ2dlciAiRGlzYWJsaW5nIEFDUyBvbiBgbHNwY2kgLXMgJHtCREZ9YCIKICAgIHNldHBjaSAtdiAtcyAke0JERn0gRUNBUF9BQ1MrMHg2Lnc9MDAwMAogICAgaWYgWyAkPyAtbmUgMCBdOyB0aGVuCiAgICAgICAgbG9nZ2VyICJFcnJvciBkaXNhYmxpbmcgQUNTIG9uICR7QkRGfSIKICAgICAgICAgICAgY29udGludWUKICAgIGZpCiAgICBORVdfVkFMPWBzZXRwY2kgLXYgLXMgJHtCREZ9IEVDQVBfQUNTKzB4Ni53IHwgYXdrICd7cHJpbnQgJE5GfSdgCiAgICBpZiBbICIke05FV19WQUx9IiAhPSAiMDAwMCIgXTsgdGhlbgogICAgICAgIGxvZ2dlciAiRmFpbGVkIHRvIGRpc2FibGUgQUNTIG9uICR7QkRGfSIKICAgICAgICAgICAgY29udGludWUKICAgIGZpCmRvbmUKZXhpdCAwCg== filesystem: root mode: 0755 systemd: units: - name: disable-acs.service enabled: true contents: | [Unit] Description=Run to disable pci acs After=network-online.target [Service] Type=oneshot ExecStart=/usr/local/bin/disable-acs.sh RemainAfterExit=yes [Install] WantedBy=multi-user.target
The Base64-encoded data in the YAML above correspond to the following plain text:
#!/bin/bash # must be root to access extended PCI config space echo "Starting the script" if [ "$EUID" -ne 0 ]; then echo "ERROR: $0 must be run as root" exit 1 fi for BDF in `lspci -d "*:*:*" | awk '{print $1}'`; do # skip if it doesn't support ACS setpci -v -s ${BDF} ECAP_ACS+0x6.w > /dev/null 2>&1 if [ $? -ne 0 ]; then #echo "${BDF} does not support ACS, skipping" continue fi logger "Disabling ACS on `lspci -s ${BDF}`" setpci -v -s ${BDF} ECAP_ACS+0x6.w=0000 if [ $? -ne 0 ]; then logger "Error disabling ACS on ${BDF}" continue fi NEW_VAL=`setpci -v -s ${BDF} ECAP_ACS+0x6.w | awk '{print $NF}'` if [ "${NEW_VAL}" != "0000" ]; then logger "Failed to disable ACS on ${BDF}" continue fi done exit 0
Apply the machine config:
oc create -f disable_acs.yaml
This command reboots the worker nodes. Wait for 15-20 minutes for the nodes to come back up before proceeding.
Ensure
READYMACHINECOUNTshows the total count of worker nodes:oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE holoscanmedia rendered-holoscanmedia-7e2a6af0d84770766c9246dffdf55a40 False True False 2 0 0 0 100m master rendered-master-fd62216b1239cd90a6a3cd7c3b9e213d True False False 3 3 3 0 4h worker rendered-worker-7e2a6af0d84770766c9246dffdf55a40 True False False 0 0 0 0 4h