Aerial System Scripts#

System Configuration Validation Script#

A script is included in the release package to check and display key software versions and system configuration settings required for running Aerial CUDA Accelerate RAN:

$ pip3 install psutil packaging paramiko
$ cd $cuBB_SDK/cuPHY/util/cuBB_system_checks
$ sudo -E python3 ./cuBB_system_checks.py

The output of cuBB_system_checks.py may differ slightly between bare-metal, container, Kubernetes-based platforms. The script helps retrieve software-component versions and hardware configurations. Refer to the Release Manifest in the cuBB Release Notes to ensure the correct software-component versions are installed. Because some software-component versions and hardware configurations cannot be retrieved directly from the Aerial container, the script can use SSH to gather the information from the host if it is run from within the container. Below is an example of using SSH with password authentication:

$ python3 cuBB_system_checks.py --host <hostname or IP address> --username <username on the host>
[+] Connecting to <hostname> with password auth.
Password for <username>@<hostname>:
[+] Caching sudo password...
[+] Sudo password cached successfully.

If you are using Red Hat OpenShift to manage Aerial, the script can retrieve information using the oc command:

$ oc get nodes  # check if you have already logged in a RHOCP cluster
NAME                          STATUS   ROLES                         AGE   VERSION
gh-smc-cg1-qs-06.nvidia.com   Ready    control-plane,master,worker   70d   v1.28.11+add48d0
$ python3 cuBB_system_checks.py --cli oc

Below is an example of the script’s output from a container with SSH access to the host:

# To get the system or ptp info, the command has to run on the host.
$ python3 cuBB_system_checks.py --host <hostname or IP address> --username <username on the host>
[+] Connecting to <hostname of IP address> with password auth.
Password for <username>@<hostname of IP address>:
[+] Caching sudo password...
[+] Sudo password cached successfully.
-----General--------------------------------------
Hostname                           : smc-gh-01
IP address                         : <IP address>
Linux distro                       : "Ubuntu 22.04.4 LTS"
Linux kernel version               : 6.5.0-1019-nvidia-64k
-----System---------------------------------------
FRU Device Description : Builtin FRU Device (ID 0)
Board Mfg Date        : Mon Jan  1 00:00:00 1996
Board Mfg             : Supermicro
Board Serial          :
Product Serial        :

FRU Device Description : BMC FRU (ID 2)
Board Mfg Date        : Mon Apr 17 10:40:00 2023
Board Mfg             : Supermicro
Board Product         : BMC Secure Control Module
Board Serial          :
Board Part Number     : AOM-SCM-NV
Product Manufacturer  : Supermicro
Product Name          : BMC Secure Control Module
Product Part Number   : AOM-SCM-NV
Product Version       : 1.00

FRU Device Description : AOC1 FRU (ID 4)
Board Mfg Date        : Wed Aug  2 20:41:00 2023
Board Mfg             : Nvidia
Board Product         : BlueField-3 SmartNIC Main Card
Board Serial          :
Board Part Number     : 900-9D3B6-00CV-AA0
Product Manufacturer  : Nvidia
Product Name          : BlueField-3 SmartNIC Main Card
Product Part Number   : 900-9D3B6-00CV-AA0
Product Version       : A9
Product Serial        :
Product Asset Tag     : 900-9D3B6-00CV-AA0

FRU Device Description : MB FRU (ID 1)
Invalid FRU size 0

FRU Device Description : CPU FRU (ID 3)
Board Mfg Date        : Wed Jul  5 21:53:00 2023
Board Mfg             : NVIDIA
Board Product         : PG530
Board Serial          :
Board Part Number     : 699-2G530-0206-QS1
Product Manufacturer  : NVIDIA
Product Name          : GH200 480GB
Product Part Number   : 900-2G530-0000-000
Product Version       : A-R00
Product Serial        :

FRU Device Description : AOC2 FRU (ID 5)
Board Mfg Date        : Thu Jul 27 02:16:00 2023
Board Mfg             : Nvidia
Board Product         : BlueField-3 SmartNIC Main Card
Board Serial          :
Board Part Number     : 900-9D3B6-00CV-AA0
Product Manufacturer  : Nvidia
Product Name          : BlueField-3 SmartNIC Main Card
Product Part Number   : 900-9D3B6-00CV-AA0
Product Version       : A9
Product Serial        :
Product Asset Tag     : 900-9D3B6-00CV-AA0
-----Kernel Command Line--------------------------
Audit subsystem                    : audit=0
Clock source                       : N/A
HugePage count                     : hugepages=48
HugePage size                      : hugepagesz=512M
CPU idle time management           : idle=poll
Max Intel C-state                  : N/A
Intel IOMMU                        : N/A
IOMMU                              : N/A
Isolated CPUs                      : isolcpus=managed_irq,domain,4-64
Corrected errors                   : N/A
Adaptive-tick CPUs                 : nohz_full=4-64
Soft-lockup detector disable       : nosoftlockup
Max processor C-state              : processor.max_cstate=0
RCU callback polling               : rcu_nocb_poll
No-RCU-callback CPUs               : rcu_nocbs=4-64
TSC stability checks               : tsc=reliable
IRQ affinity                       : irqaffinity=0
ACPI power meter cap forcely on    : acpi_power_meter.force_cap_on=y
NUMA balancing                     : numa_balancing=disable
Mem init on alloc                  : init_on_alloc=0
Preempt                            : preempt=none
Pressure Stall Information         : N/A  ("psi=0" is recommended)
-----CPU------------------------------------------
CPU cores                          : 72
Thread(s) per CPU core             : 1
CPU max MHz:                       : 3456.0000
CPU sockets                        : 1
-----Environment variables------------------------
CUDA_DEVICE_MAX_CONNECTIONS        : 8
cuBB_SDK                           : /opt/nvidia/cuBB
-----Memory---------------------------------------
HugePage count                     : 72
Free HugePages                     : 70
HugePage size                      : 524288 kB
Shared memory size                 : 240G
-----Nvidia GPUs----------------------------------
GPU driver version                 : 570.124.06
CUDA version                       : 12.8
GPU0
  GPU product name                 : NVIDIA GH200 480GB
  GPU persistence mode             : Enabled
  Current GPU temperature          : 34 C
  Max GPU clock frequency          : 1980 MHz
  GPU clock frequency              : 1980 MHz
  GPU PCIe bus id                  : 00000009:01:00.0
-----GPUDirect topology---------------------------
GPU0    NIC0    NIC1    NIC2    NIC3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NODE    NODE    NODE    NODE    0-71    0               1
NIC0    NODE     X      PIX     NODE    NODE
NIC1    NODE    PIX      X      NODE    NODE
NIC2    NODE    NODE    NODE     X      PIX
NIC3    NODE    NODE    NODE    PIX      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
-----Loaded Kernel Modules------------------------
GDRCopy                            : gdrdrv
GPUDirect RDMA                     : N/A
Nvidia                             : nvidia
-----Non-persistent settings----------------------
VM swappiness                      : vm.swappiness = 0
VM zone reclaim mode               : vm.zone_reclaim_mode = 0
-----Kernel Parameters----------------------------
Real-time throttling               : -1
Transparent hugepage               : [madvise]
-----Software Packages----------------------------
docker      /usr/bin               : 27.3.1
NVIDIA Container Toolkit           : 1.17.4
OFED version                       : OFED-internal-24.04-0.6.6
ptp4l       /usr/sbin              : 3.1.1-3
-----Software Packages in the Container-----------
-----Linux PTP------------------------------------
● ptp4l.service - Precision Time Protocol (PTP) service
    Loaded: loaded (/lib/systemd/system/ptp4l.service; enabled; vendor preset: enabled)
    Active: active (running) since Wed 2024-11-27 01:58:59 UTC; 2 months 14 days ago
      Docs: man:ptp4l
  Main PID: 3903 (ptp4l)
      Tasks: 1 (limit: 146899)
    Memory: 7.3M
        CPU: 58min 50.438s
    CGroup: /system.slice/ptp4l.service
            └─3903 /usr/sbin/ptp4l -f /etc/ptp.conf

Feb 10 06:27:41 smc-gh-01 ptp4l[3903]: [6496263.224] rms    2 max    4 freq  -4911 +/-  12 delay   -92 +/-   0
Feb 10 06:27:42 smc-gh-01 ptp4l[3903]: [6496264.224] rms    2 max    4 freq  -4908 +/-   9 delay   -93 +/-   0
Feb 10 06:27:43 smc-gh-01 ptp4l[3903]: [6496265.224] rms    3 max    7 freq  -4912 +/-  13 delay   -93 +/-   0
Feb 10 06:27:44 smc-gh-01 ptp4l[3903]: [6496266.224] rms    2 max    5 freq  -4919 +/-   8 delay   -93 +/-   0
Feb 10 06:27:45 smc-gh-01 ptp4l[3903]: [6496267.225] rms    2 max    5 freq  -4910 +/-   9 delay   -93 +/-   0
Feb 10 06:27:46 smc-gh-01 ptp4l[3903]: [6496268.225] rms    2 max    5 freq  -4911 +/-  11 delay   -93 +/-   0
Feb 10 06:27:47 smc-gh-01 ptp4l[3903]: [6496269.225] rms    3 max    7 freq  -4908 +/-  15 delay   -93 +/-   0
Feb 10 06:27:48 smc-gh-01 ptp4l[3903]: [6496270.225] rms    2 max    3 freq  -4911 +/-   9 delay   -93 +/-   0
Feb 10 06:27:49 smc-gh-01 ptp4l[3903]: [6496271.225] rms    2 max    5 freq  -4919 +/-   9 delay   -93 +/-   0
Feb 10 06:27:50 smc-gh-01 ptp4l[3903]: [6496272.225] rms    2 max    3 freq  -4912 +/-   9 delay   -93 +/-   0 phc2sys.service - Synchronize system clock or PTP hardware clock (PHC)
    Loaded: loaded (/lib/systemd/system/phc2sys.service; enabled; vendor preset: enabled)
    Active: active (running) since Wed 2024-11-27 01:59:01 UTC; 2 months 14 days ago
      Docs: man:phc2sys
  Main PID: 4304 (sh)
      Tasks: 2 (limit: 146899)
    Memory: 2.0M
        CPU: 5h 45min 34.886s
    CGroup: /system.slice/phc2sys.service
            ├─4304 /bin/sh -c "taskset -c 21 /usr/sbin/phc2sys -s /dev/ptp\$(ethtool -T aerial01 | grep PTP | awk '{print \$4}') -c CLOCK_REALTIME -n 24 -O 0 -R 256 -u 256"
            └─4309 /usr/sbin/phc2sys -s /dev/ptp1 -c CLOCK_REALTIME -n 24 -O 0 -R 256 -u 256

Feb 10 06:27:40 smc-gh-01 phc2sys[4309]: [6496262.994] CLOCK_REALTIME rms    7 max   19 freq   -934 +/-  14 delay   506 +/-  12
Feb 10 06:27:41 smc-gh-01 phc2sys[4309]: [6496264.010] CLOCK_REALTIME rms    8 max   19 freq   -934 +/-  18 delay   506 +/-  12
Feb 10 06:27:42 smc-gh-01 phc2sys[4309]: [6496265.026] CLOCK_REALTIME rms    7 max   19 freq   -942 +/-  19 delay   508 +/-  11
Feb 10 06:27:43 smc-gh-01 phc2sys[4309]: [6496266.042] CLOCK_REALTIME rms    8 max   19 freq   -935 +/-  30 delay   506 +/-  13
Feb 10 06:27:44 smc-gh-01 phc2sys[4309]: [6496267.058] CLOCK_REALTIME rms    7 max   17 freq   -933 +/-  11 delay   506 +/-  13
Feb 10 06:27:46 smc-gh-01 phc2sys[4309]: [6496268.074] CLOCK_REALTIME rms    7 max   17 freq   -929 +/-  10 delay   506 +/-  12
Feb 10 06:27:47 smc-gh-01 phc2sys[4309]: [6496269.091] CLOCK_REALTIME rms    7 max   18 freq   -941 +/-  15 delay   506 +/-  13
Feb 10 06:27:48 smc-gh-01 phc2sys[4309]: [6496270.107] CLOCK_REALTIME rms    8 max   18 freq   -938 +/-  10 delay   506 +/-  12
Feb 10 06:27:49 smc-gh-01 phc2sys[4309]: [6496271.123] CLOCK_REALTIME rms    8 max   19 freq   -937 +/-  21 delay   507 +/-  12
Feb 10 06:27:50 smc-gh-01 phc2sys[4309]: [6496272.139] CLOCK_REALTIME rms    7 max   18 freq   -932 +/-  16 delay   506 +/-  12
-----NTP------------------------------------------
NTP                                : inactive
-----Mellanox NIC Interfaces----------------------
Interface0
  Name                             : aerial00
  Network adapter                  : mlx5_0
  PCIe bus id                      : 0000:01:00.0
  Ethernet address                 : 94:6d:ae:f5:a9:12
  Operstate                        : up
  MTU                              : 1500
  RX flow control                  : off
  TX flow control                  : off
  PTP hardware clock               : 0
  QoS Priority trust state         : pcp
  PCIe MRRS                        : N/A
High-quality Tx timestamp          : on
Interface1
  Name                             : aerial01
  Network adapter                  : mlx5_0
  PCIe bus id                      : 0000:01:00.1
  Ethernet address                 : 94:6d:ae:f5:a9:13
  Operstate                        : up
  MTU                              : 1500
  RX flow control                  : off
  TX flow control                  : off
  PTP hardware clock               : 1
  QoS Priority trust state         : pcp
  PCIe MRRS                        : N/A
High-quality Tx timestamp          : on
Interface2
  Name                             : aerial02
  Network adapter                  : mlx5_1
  PCIe bus id                      : 0002:01:00.0
  Ethernet address                 : 94:6d:ae:f5:a0:e8
  Operstate                        : up
  MTU                              : 1500
  RX flow control                  : off
  TX flow control                  : off
  PTP hardware clock               : 2
  QoS Priority trust state         : pcp
  PCIe MRRS                        : N/A
High-quality Tx timestamp          : on
Interface3
  Name                             : aerial03
  Network adapter                  : mlx5_1
  PCIe bus id                      : 0002:01:00.1
  Ethernet address                 : 94:6d:ae:f5:a0:e9
  Operstate                        : down
  MTU                              : 1500
  RX flow control                  : off
  TX flow control                  : off
  PTP hardware clock               : 3
  QoS Priority trust state         : pcp
  PCIe MRRS                        : N/A
High-quality Tx timestamp          : on
-----Mellanox NICs--------------------------------
NIC1
  NIC product name                 : BlueField3
  NIC part number                  : 900-9D3B6-00CV-A_Ax
  NIC PCIe bus id                  : /dev/mst/mt41692_pciconf1
  NIC FW version                   : 32.41.1000
  INTERNAL_CPU_MODEL               : EMBEDDED_CPU(1)
  INTERNAL_CPU_PAGE_SUPPLIER       : EXT_HOST_PF(1)
  INTERNAL_CPU_ESWITCH_MANAGER     : EXT_HOST_PF(1)
  INTERNAL_CPU_IB_VPORT0           : EXT_HOST_PF(1)
  INTERNAL_CPU_OFFLOAD_ENGINE      : DISABLED(1)
  FLEX_PARSER_PROFILE_ENABLE       : 4
  PROG_PARSE_GRAPH                 : True(1)
  ACCURATE_TX_SCHEDULER            : True(1)
  CQE_COMPRESSION                  : AGGRESSIVE(1)
  REAL_TIME_CLOCK_ENABLE           : True(1)
  LINK_TYPE_P1                     : ETH(2)
  LINK_TYPE_P2                     : ETH(2)
NIC2
  NIC product name                 : BlueField3
  NIC part number                  : 900-9D3B6-00CV-A_Ax
  NIC PCIe bus id                  : /dev/mst/mt41692_pciconf0
  NIC FW version                   : 32.41.1000
  INTERNAL_CPU_MODEL               : EMBEDDED_CPU(1)
  INTERNAL_CPU_PAGE_SUPPLIER       : EXT_HOST_PF(1)
  INTERNAL_CPU_ESWITCH_MANAGER     : EXT_HOST_PF(1)
  INTERNAL_CPU_IB_VPORT0           : EXT_HOST_PF(1)
  INTERNAL_CPU_OFFLOAD_ENGINE      : DISABLED(1)
  FLEX_PARSER_PROFILE_ENABLE       : 4
  PROG_PARSE_GRAPH                 : True(1)
  ACCURATE_TX_SCHEDULER            : True(1)
  CQE_COMPRESSION                  : AGGRESSIVE(1)
  REAL_TIME_CLOCK_ENABLE           : True(1)
  LINK_TYPE_P1                     : ETH(2)
  LINK_TYPE_P2                     : ETH(2)

Checking the NIC Status#

To query back the Mellanox NIC firmware settings initialized with the script above, use these commands:

$ sudo mlxconfig -d /dev/mst/mt41692_pciconf0 q | grep "CQE_COMPRESSION\|PROG_PARSE_GRAPH\
  \|ACCURATE_TX_SCHEDULER\|FLEX_PARSER_PROFILE_ENABLE\|REAL_TIME_CLOCK_ENABLE\|INTERNAL_CPU_MODEL\
  \|LINK_TYPE_P1\|LINK_TYPE_P2\|INTERNAL_CPU_PAGE_SUPPLIER\|INTERNAL_CPU_ESWITCH_MANAGER\
  \|INTERNAL_CPU_IB_VPORT0\|INTERNAL_CPU_OFFLOAD_ENGINE"

        INTERNAL_CPU_MODEL                          EMBEDDED_CPU(1)
        INTERNAL_CPU_PAGE_SUPPLIER                  EXT_HOST_PF(1)
        INTERNAL_CPU_ESWITCH_MANAGER                EXT_HOST_PF(1)
        INTERNAL_CPU_IB_VPORT0                      EXT_HOST_PF(1)
        INTERNAL_CPU_OFFLOAD_ENGINE                 DISABLED(1)
        FLEX_PARSER_PROFILE_ENABLE                  4
        PROG_PARSE_GRAPH                            True(1)
        ACCURATE_TX_SCHEDULER                       True(1)
        CQE_COMPRESSION                             AGGRESSIVE(1)
        REAL_TIME_CLOCK_ENABLE                      True(1)
        LINK_TYPE_P1                                ETH(2)
        LINK_TYPE_P2                                ETH(2)

To check the current status of a NIC port, use this command:

$ sudo mlxlink -d /dev/mst/mt41692_pciconf0

Operational Info
----------------
State                              : Active
Physical state                     : LinkUp
Speed                              : 200G
Width                              : 4x
FEC                                : Standard_RS-FEC - (544,514)
Loopback Mode                      : No Loopback
Auto Negotiation                   : ON

Supported Info
--------------
Enabled Link Speed (Ext.)          : 0x00003ff2 (200G_2X,200G_4X,100G_1X,100G_2X,100G_4X,50G_1X,50G_2X,40G,25G,10G,1G)
Supported Cable Speed (Ext.)       : 0x000017f2 (200G_4X,100G_2X,100G_4X,50G_1X,50G_2X,40G,25G,10G,1G)

Troubleshooting Info
--------------------
Status Opcode                      : 0
Group Opcode                       : N/A
Recommendation                     : No issue was observed

Tool Information
----------------
Firmware Version                   : 32.41.1000
amBER Version                      : 3.2
MFT Version                        : mft 4.28.0-92

Alternatively, you can use the System Configuration Validation Script to obtain a full list of configuration settings.