Post Installation Configuration#

To avoid any issues copying the content of the configuration files from the documentation, download the archive from the Holoscan for Media Platform Setup resource in the NGC catalog linked from the Holoscan for Media collection.

Worker Node Tuning#

Create MachineConfigPool#

Skip this section if you are on a compact (3-node) cluster or SNO. This step is specific to 5-node clusters because a new MCP is created, which inherits from the worker MCP. In a 3-node cluster or SNO, all configuration is applied to the master MCP itself.

  1. Label all worker nodes with a custom label. For example, node-role.kubernetes.io/holoscanmedia=

    oc label node <worker-node> node-role.kubernetes.io/holoscanmedia=
    
  2. Create holoscanmedia_mcp.yaml with the following content:

    apiVersion: machineconfiguration.openshift.io/v1
    kind: MachineConfigPool
    metadata:
        labels:
            holoscanmedia: ''
            machineconfiguration.openshift.io/role: holoscanmedia
        name: holoscanmedia
    spec:
        machineConfigSelector:
            matchExpressions:
            -   key: machineconfiguration.openshift.io/role
                operator: In
                values:
                - worker
                - holoscanmedia
        nodeSelector:
            matchLabels:
                node-role.kubernetes.io/holoscanmedia: ''
    
  3. Apply the machine config pool:

    oc create -f holoscanmedia_mcp.yaml
    

Create Performance Profile#

Performance Profile is used to tune the cluster nodes for high performance, low latency workloads.

  1. Create holoscanmedia_profile.yaml based on the following template:

    apiVersion: performance.openshift.io/v2
    kind: PerformanceProfile
    metadata:
        name: irq-performanceprofile
    spec:
        globallyDisableIrqLoadBalancing: true
        additionalKernelArgs:
        - nmi_watchdog=0
        - audit=0
        - mce=ignore_ce
        - processor.max_cstate=0
        - idle=poll
        - intel_idle.max_cstate=0
        - intel_pstate=disable
        - nosmt
        - nosoftlockup
        - cpufreq.default_governor=performance
        - module_blacklist=irdma
        - nokaslr
        cpu:
            isolated: 8-63
            reserved: 0-7
        hugepages:
            defaultHugepagesSize: 2M
            pages:
            - count: 32768
              size: 2M
        numa:
            topologyPolicy: single-numa-node
        nodeSelector:
            node-role.kubernetes.io/<machine-config-pool>: ''
        realTimeKernel:
            enabled: false
    
    • The additional kernel arguments are derived from the Rivermax Linux Performance Tuning Guide (or find the latest version at Rivermax Getting Started).

    • Replace <machine-config-pool> with holoscanmedia for a 5-node cluster and master for a 3-node cluster or SNO.

    • Reserve 4–8 CPU cores for the system and isolate the rest of the CPUs for application pods. Modify the isolated and reserved fields in the above file appropriately.

    • NUMA-aware scheduling is required for maximum performance. However, for servers with fewer network adapters or GPUs, or an unbalanced topology, add numa=off to the additionalKernelArgs and remove the numa section to enable the utilization of available resources.

    Important

    After configuring the Topology Manager single-numa-node policy, using the default-scheduler for pods that request NUMA-aligned resources can cause runaway pod creation errors (ContainerStatusUnknown). Follow the steps to install and use the NUMA-aware scheduler.

  2. Apply the performance profile:

    oc create -f holoscanmedia_profile.yaml
    

    This reboots the worker nodes one by one.

  3. You can monitor the status using the following command. Wait for the UPDATED column to become True. This step can take 30–45 minutes to complete.

    oc get mcp
    NAME            CONFIG                                                    UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
    holoscanmedia   rendered-holoscanmedia-7e2a6af0d84770766c9246dffdf55a40   False     True       False      2              0                   0                     0                      100m
    master          rendered-master-fd62216b1239cd90a6a3cd7c3b9e213d          True      False      False      3              3                   3                     0                      4h
    worker          rendered-worker-7e2a6af0d84770766c9246dffdf55a40          True      False      False      0              0                   0                     0                      4h
    
  4. After the servers comes back online, check hugepages allocation from the jump node using the following command:

    oc get node <worker-node> -o jsonpath="{.status.allocatable.hugepages-2Mi}" 64Gi
    

Hardware Tuning#

  1. Create holoscanmedia_tuned.yaml based on the following template:

    apiVersion: tuned.openshift.io/v1
    kind: Tuned
    metadata:
        name: holoscanmedia-tuning-profile
        namespace: openshift-cluster-node-tuning-operator
    spec:
        profile:
        - data: |
            [main]
            summary=Boot time configuration for Holoscan for Media
            include=openshift-node-performance-irq-performanceprofile
            [sysctl]
            vm.swappiness=0
            vm.zone_reclaim_mode=0
            kernel.numa_balancing=0
            kernel.sched_rt_runtime_us=-1
            [service]
            service.irqbalance=stop,disable
            [vm]
            transparent_hugepages=never
          name: openshift-node-holoscanmedia
    recommend:
    - machineConfigLabels:
        machineconfiguration.openshift.io/role: "<machine-config-pool>"
      priority: 10
      profile: openshift-node-holoscanmedia
    
  2. Apply the TuneD profile:

    oc create -f holoscanmedia_tuned.yaml
    

Disable Chronyd Service#

  1. Create disable_chronyd.yaml based on the following template. Replace <role> with worker for a 5-node cluster and master for a 3-node cluster or SNO:

    apiVersion: machineconfiguration.openshift.io/v1
    kind: MachineConfig
    metadata:
        labels:
            machineconfiguration.openshift.io/role: <role>
        name: disable-chronyd
    spec:
        config:
            ignition:
                version: 3.2.0
            systemd:
                units:
                - contents: |
                    [Unit]
                    Description=NTP client/server
                    Documentation=man:chronyd(8) man:chrony.conf(5)
                    After=ntpdate.service sntp.service ntpd.service
                    Conflicts=ntpd.service systemd-timesyncd.service
                    ConditionCapability=CAP_SYS_TIME
                    [Service]
                    Type=forking
                    PIDFile=/run/chrony/chronyd.pid
                    EnvironmentFile=-/etc/sysconfig/chronyd
                    ExecStart=/usr/sbin/chronyd $OPTIONS
                    ExecStartPost=/usr/libexec/chrony-helper update-daemon
                    PrivateTmp=yes
                    ProtectHome=yes
                    ProtectSystem=full
                    [Install]
                    WantedBy=multi-user.target
                  enabled: false
                  name: "chronyd.service"
    
  2. Apply the machine config:

    oc create -f disable_chronyd.yaml
    

    This command reboots the worker nodes. Wait for 15-20 minutes for the nodes to come back up before proceeding.

  3. Ensure READYMACHINECOUNT shows the total count of worker nodes:

    oc get mcp
    NAME            CONFIG                                                    UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
    holoscanmedia   rendered-holoscanmedia-7e2a6af0d84770766c9246dffdf55a40   False     True       False      2              0                   0                     0                      100m
    master          rendered-master-fd62216b1239cd90a6a3cd7c3b9e213d          True      False      False      3              3                   3                     0                      4h
    worker          rendered-worker-7e2a6af0d84770766c9246dffdf55a40          True      False      False      0              0                   0                     0                      4h
    

Disable PCIe ACS#

PCIe Access Control Services (ACS) must be disabled to maximize achievable bandwidth.

  1. Create disable_acs.yaml based on the following template. Replace <role> with worker for a 5-node cluster and master for a 3-node cluster or SNO:

    apiVersion: machineconfiguration.openshift.io/v1
    kind: MachineConfig
    metadata:
        labels:
            machineconfiguration.openshift.io/role: <role>
        name: disable-acs
    spec:
        config:
            ignition:
                version: 3.4.0
            storage:
                files:
                - path: /usr/local/bin/disable-acs.sh
                  contents:
                    source: data:;charset=utf-8;base64,IyEvYmluL2Jhc2gKIyBtdXN0IGJlIHJvb3QgdG8gYWNjZXNzIGV4dGVuZGVkIFBDSSBjb25maWcgc3BhY2UKZWNobyAiU3RhcnRpbmcgdGhlIHNjcmlwdCIKaWYgWyAiJEVVSUQiIC1uZSAwIF07IHRoZW4KICBlY2hvICJFUlJPUjogJDAgbXVzdCBiZSBydW4gYXMgcm9vdCIKICBleGl0IDEKZmkKZm9yIEJERiBpbiBgbHNwY2kgLWQgIio6KjoqIiB8IGF3ayAne3ByaW50ICQxfSdgOyBkbwogICAgIyBza2lwIGlmIGl0IGRvZXNuJ3Qgc3VwcG9ydCBBQ1MKICAgIHNldHBjaSAtdiAtcyAke0JERn0gRUNBUF9BQ1MrMHg2LncgPiAvZGV2L251bGwgMj4mMQogICAgaWYgWyAkPyAtbmUgMCBdOyB0aGVuCiAgICAgICAgICAgICNlY2hvICIke0JERn0gZG9lcyBub3Qgc3VwcG9ydCBBQ1MsIHNraXBwaW5nIgogICAgICAgICAgICBjb250aW51ZQogICAgZmkKICAgIGxvZ2dlciAiRGlzYWJsaW5nIEFDUyBvbiBgbHNwY2kgLXMgJHtCREZ9YCIKICAgIHNldHBjaSAtdiAtcyAke0JERn0gRUNBUF9BQ1MrMHg2Lnc9MDAwMAogICAgaWYgWyAkPyAtbmUgMCBdOyB0aGVuCiAgICAgICAgbG9nZ2VyICJFcnJvciBkaXNhYmxpbmcgQUNTIG9uICR7QkRGfSIKICAgICAgICAgICAgY29udGludWUKICAgIGZpCiAgICBORVdfVkFMPWBzZXRwY2kgLXYgLXMgJHtCREZ9IEVDQVBfQUNTKzB4Ni53IHwgYXdrICd7cHJpbnQgJE5GfSdgCiAgICBpZiBbICIke05FV19WQUx9IiAhPSAiMDAwMCIgXTsgdGhlbgogICAgICAgIGxvZ2dlciAiRmFpbGVkIHRvIGRpc2FibGUgQUNTIG9uICR7QkRGfSIKICAgICAgICAgICAgY29udGludWUKICAgIGZpCmRvbmUKZXhpdCAwCg==
                  filesystem: root
                  mode: 0755
            systemd:
                units:
                - name: disable-acs.service
                  enabled: true
                  contents: |
                    [Unit]
                    Description=Run to disable pci acs
                    After=network-online.target
                    [Service]
                    Type=oneshot
                    ExecStart=/usr/local/bin/disable-acs.sh
                    RemainAfterExit=yes
                    [Install]
                    WantedBy=multi-user.target
    

    The Base64-encoded data in the YAML above correspond to the following plain text:

    #!/bin/bash
    # must be root to access extended PCI config space
    echo "Starting the script"
    if [ "$EUID" -ne 0 ]; then
    echo "ERROR: $0 must be run as root"
    exit 1
    fi
    for BDF in `lspci -d "*:*:*" | awk '{print $1}'`; do
    # skip if it doesn't support ACS
    setpci -v -s ${BDF} ECAP_ACS+0x6.w > /dev/null 2>&1
    if [ $? -ne 0 ]; then
        #echo "${BDF} does not support ACS, skipping"
        continue
    fi
    logger "Disabling ACS on `lspci -s ${BDF}`"
    setpci -v -s ${BDF} ECAP_ACS+0x6.w=0000
    if [ $? -ne 0 ]; then
        logger "Error disabling ACS on ${BDF}"
            continue
    fi
    NEW_VAL=`setpci -v -s ${BDF} ECAP_ACS+0x6.w | awk '{print $NF}'`
    if [ "${NEW_VAL}" != "0000" ]; then
        logger "Failed to disable ACS on ${BDF}"
            continue
    fi
    done
    exit 0
    
  2. Apply the machine config:

    oc create -f disable_acs.yaml
    

    This command reboots the worker nodes. Wait for 15-20 minutes for the nodes to come back up before proceeding.

  3. Ensure READYMACHINECOUNT shows the total count of worker nodes:

    oc get mcp
    NAME            CONFIG                                                    UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
    holoscanmedia   rendered-holoscanmedia-7e2a6af0d84770766c9246dffdf55a40   False     True       False      2              0                   0                     0                      100m
    master          rendered-master-fd62216b1239cd90a6a3cd7c3b9e213d          True      False      False      3              3                   3                     0                      4h
    worker          rendered-worker-7e2a6af0d84770766c9246dffdf55a40          True      False      False      0              0                   0                     0                      4h