Aerial System Scripts#
System Configuration Validation Script#
A script is included in the release package to check and display key software versions and system configuration settings required for running Aerial CUDA Accelerate RAN:
$ pip3 install psutil packaging paramiko
$ cd $cuBB_SDK/cuPHY/util/cuBB_system_checks
$ sudo -E python3 ./cuBB_system_checks.py
The output of cuBB_system_checks.py
may differ slightly between bare-metal, container, Kubernetes-based platforms.
The script helps retrieve software-component versions and hardware configurations.
Refer to the Release Manifest in the cuBB Release Notes to ensure
the correct software-component versions are installed.
Because some software-component versions and hardware configurations cannot be retrieved directly from the Aerial container,
the script can use SSH to gather the information from the host if it is run from within the container.
Below is an example of using SSH with password authentication:
$ python3 cuBB_system_checks.py --host <hostname or IP address> --username <username on the host>
[+] Connecting to <hostname> with password auth.
Password for <username>@<hostname>:
[+] Caching sudo password...
[+] Sudo password cached successfully.
If you are using Red Hat OpenShift to manage Aerial, the script can retrieve information using the oc command:
$ oc get nodes # check if you have already logged in a RHOCP cluster
NAME STATUS ROLES AGE VERSION
gh-smc-cg1-qs-06.nvidia.com Ready control-plane,master,worker 70d v1.28.11+add48d0
$ python3 cuBB_system_checks.py --cli oc
Below is an example of the script’s output from a container with SSH access to the host:
# To get the system or ptp info, the command has to run on the host.
$ python3 cuBB_system_checks.py --host <hostname or IP address> --username <username on the host>
[+] Connecting to <hostname of IP address> with password auth.
Password for <username>@<hostname of IP address>:
[+] Caching sudo password...
[+] Sudo password cached successfully.
-----General--------------------------------------
Hostname : smc-gh-01
IP address : <IP address>
Linux distro : "Ubuntu 22.04.4 LTS"
Linux kernel version : 6.5.0-1019-nvidia-64k
-----System---------------------------------------
FRU Device Description : Builtin FRU Device (ID 0)
Board Mfg Date : Mon Jan 1 00:00:00 1996
Board Mfg : Supermicro
Board Serial :
Product Serial :
FRU Device Description : BMC FRU (ID 2)
Board Mfg Date : Mon Apr 17 10:40:00 2023
Board Mfg : Supermicro
Board Product : BMC Secure Control Module
Board Serial :
Board Part Number : AOM-SCM-NV
Product Manufacturer : Supermicro
Product Name : BMC Secure Control Module
Product Part Number : AOM-SCM-NV
Product Version : 1.00
FRU Device Description : AOC1 FRU (ID 4)
Board Mfg Date : Wed Aug 2 20:41:00 2023
Board Mfg : Nvidia
Board Product : BlueField-3 SmartNIC Main Card
Board Serial :
Board Part Number : 900-9D3B6-00CV-AA0
Product Manufacturer : Nvidia
Product Name : BlueField-3 SmartNIC Main Card
Product Part Number : 900-9D3B6-00CV-AA0
Product Version : A9
Product Serial :
Product Asset Tag : 900-9D3B6-00CV-AA0
FRU Device Description : MB FRU (ID 1)
Invalid FRU size 0
FRU Device Description : CPU FRU (ID 3)
Board Mfg Date : Wed Jul 5 21:53:00 2023
Board Mfg : NVIDIA
Board Product : PG530
Board Serial :
Board Part Number : 699-2G530-0206-QS1
Product Manufacturer : NVIDIA
Product Name : GH200 480GB
Product Part Number : 900-2G530-0000-000
Product Version : A-R00
Product Serial :
FRU Device Description : AOC2 FRU (ID 5)
Board Mfg Date : Thu Jul 27 02:16:00 2023
Board Mfg : Nvidia
Board Product : BlueField-3 SmartNIC Main Card
Board Serial :
Board Part Number : 900-9D3B6-00CV-AA0
Product Manufacturer : Nvidia
Product Name : BlueField-3 SmartNIC Main Card
Product Part Number : 900-9D3B6-00CV-AA0
Product Version : A9
Product Serial :
Product Asset Tag : 900-9D3B6-00CV-AA0
-----Kernel Command Line--------------------------
Audit subsystem : audit=0
Clock source : N/A
HugePage count : hugepages=48
HugePage size : hugepagesz=512M
CPU idle time management : idle=poll
Max Intel C-state : N/A
Intel IOMMU : N/A
IOMMU : N/A
Isolated CPUs : isolcpus=managed_irq,domain,4-64
Corrected errors : N/A
Adaptive-tick CPUs : nohz_full=4-64
Soft-lockup detector disable : nosoftlockup
Max processor C-state : processor.max_cstate=0
RCU callback polling : rcu_nocb_poll
No-RCU-callback CPUs : rcu_nocbs=4-64
TSC stability checks : tsc=reliable
IRQ affinity : irqaffinity=0
ACPI power meter cap forcely on : acpi_power_meter.force_cap_on=y
NUMA balancing : numa_balancing=disable
Mem init on alloc : init_on_alloc=0
Preempt : preempt=none
Pressure Stall Information : N/A ("psi=0" is recommended)
-----CPU------------------------------------------
CPU cores : 72
Thread(s) per CPU core : 1
CPU max MHz: : 3456.0000
CPU sockets : 1
-----Environment variables------------------------
CUDA_DEVICE_MAX_CONNECTIONS : 8
cuBB_SDK : /opt/nvidia/cuBB
-----Memory---------------------------------------
HugePage count : 72
Free HugePages : 70
HugePage size : 524288 kB
Shared memory size : 240G
-----Nvidia GPUs----------------------------------
GPU driver version : 570.124.06
CUDA version : 12.8
GPU0
GPU product name : NVIDIA GH200 480GB
GPU persistence mode : Enabled
Current GPU temperature : 34 C
Max GPU clock frequency : 1980 MHz
GPU clock frequency : 1980 MHz
GPU PCIe bus id : 00000009:01:00.0
-----GPUDirect topology---------------------------
GPU0 NIC0 NIC1 NIC2 NIC3 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NODE NODE NODE NODE 0-71 0 1
NIC0 NODE X PIX NODE NODE
NIC1 NODE PIX X NODE NODE
NIC2 NODE NODE NODE X PIX
NIC3 NODE NODE NODE PIX X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
-----Loaded Kernel Modules------------------------
GDRCopy : gdrdrv
GPUDirect RDMA : N/A
Nvidia : nvidia
-----Non-persistent settings----------------------
VM swappiness : vm.swappiness = 0
VM zone reclaim mode : vm.zone_reclaim_mode = 0
-----Kernel Parameters----------------------------
Real-time throttling : -1
Transparent hugepage : [madvise]
-----Software Packages----------------------------
docker /usr/bin : 27.3.1
NVIDIA Container Toolkit : 1.17.4
OFED version : OFED-internal-24.04-0.6.6
ptp4l /usr/sbin : 3.1.1-3
-----Software Packages in the Container-----------
-----Linux PTP------------------------------------
● ptp4l.service - Precision Time Protocol (PTP) service
Loaded: loaded (/lib/systemd/system/ptp4l.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2024-11-27 01:58:59 UTC; 2 months 14 days ago
Docs: man:ptp4l
Main PID: 3903 (ptp4l)
Tasks: 1 (limit: 146899)
Memory: 7.3M
CPU: 58min 50.438s
CGroup: /system.slice/ptp4l.service
└─3903 /usr/sbin/ptp4l -f /etc/ptp.conf
Feb 10 06:27:41 smc-gh-01 ptp4l[3903]: [6496263.224] rms 2 max 4 freq -4911 +/- 12 delay -92 +/- 0
Feb 10 06:27:42 smc-gh-01 ptp4l[3903]: [6496264.224] rms 2 max 4 freq -4908 +/- 9 delay -93 +/- 0
Feb 10 06:27:43 smc-gh-01 ptp4l[3903]: [6496265.224] rms 3 max 7 freq -4912 +/- 13 delay -93 +/- 0
Feb 10 06:27:44 smc-gh-01 ptp4l[3903]: [6496266.224] rms 2 max 5 freq -4919 +/- 8 delay -93 +/- 0
Feb 10 06:27:45 smc-gh-01 ptp4l[3903]: [6496267.225] rms 2 max 5 freq -4910 +/- 9 delay -93 +/- 0
Feb 10 06:27:46 smc-gh-01 ptp4l[3903]: [6496268.225] rms 2 max 5 freq -4911 +/- 11 delay -93 +/- 0
Feb 10 06:27:47 smc-gh-01 ptp4l[3903]: [6496269.225] rms 3 max 7 freq -4908 +/- 15 delay -93 +/- 0
Feb 10 06:27:48 smc-gh-01 ptp4l[3903]: [6496270.225] rms 2 max 3 freq -4911 +/- 9 delay -93 +/- 0
Feb 10 06:27:49 smc-gh-01 ptp4l[3903]: [6496271.225] rms 2 max 5 freq -4919 +/- 9 delay -93 +/- 0
Feb 10 06:27:50 smc-gh-01 ptp4l[3903]: [6496272.225] rms 2 max 3 freq -4912 +/- 9 delay -93 +/- 0
● phc2sys.service - Synchronize system clock or PTP hardware clock (PHC)
Loaded: loaded (/lib/systemd/system/phc2sys.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2024-11-27 01:59:01 UTC; 2 months 14 days ago
Docs: man:phc2sys
Main PID: 4304 (sh)
Tasks: 2 (limit: 146899)
Memory: 2.0M
CPU: 5h 45min 34.886s
CGroup: /system.slice/phc2sys.service
├─4304 /bin/sh -c "taskset -c 21 /usr/sbin/phc2sys -s /dev/ptp\$(ethtool -T aerial01 | grep PTP | awk '{print \$4}') -c CLOCK_REALTIME -n 24 -O 0 -R 256 -u 256"
└─4309 /usr/sbin/phc2sys -s /dev/ptp1 -c CLOCK_REALTIME -n 24 -O 0 -R 256 -u 256
Feb 10 06:27:40 smc-gh-01 phc2sys[4309]: [6496262.994] CLOCK_REALTIME rms 7 max 19 freq -934 +/- 14 delay 506 +/- 12
Feb 10 06:27:41 smc-gh-01 phc2sys[4309]: [6496264.010] CLOCK_REALTIME rms 8 max 19 freq -934 +/- 18 delay 506 +/- 12
Feb 10 06:27:42 smc-gh-01 phc2sys[4309]: [6496265.026] CLOCK_REALTIME rms 7 max 19 freq -942 +/- 19 delay 508 +/- 11
Feb 10 06:27:43 smc-gh-01 phc2sys[4309]: [6496266.042] CLOCK_REALTIME rms 8 max 19 freq -935 +/- 30 delay 506 +/- 13
Feb 10 06:27:44 smc-gh-01 phc2sys[4309]: [6496267.058] CLOCK_REALTIME rms 7 max 17 freq -933 +/- 11 delay 506 +/- 13
Feb 10 06:27:46 smc-gh-01 phc2sys[4309]: [6496268.074] CLOCK_REALTIME rms 7 max 17 freq -929 +/- 10 delay 506 +/- 12
Feb 10 06:27:47 smc-gh-01 phc2sys[4309]: [6496269.091] CLOCK_REALTIME rms 7 max 18 freq -941 +/- 15 delay 506 +/- 13
Feb 10 06:27:48 smc-gh-01 phc2sys[4309]: [6496270.107] CLOCK_REALTIME rms 8 max 18 freq -938 +/- 10 delay 506 +/- 12
Feb 10 06:27:49 smc-gh-01 phc2sys[4309]: [6496271.123] CLOCK_REALTIME rms 8 max 19 freq -937 +/- 21 delay 507 +/- 12
Feb 10 06:27:50 smc-gh-01 phc2sys[4309]: [6496272.139] CLOCK_REALTIME rms 7 max 18 freq -932 +/- 16 delay 506 +/- 12
-----NTP------------------------------------------
NTP : inactive
-----Mellanox NIC Interfaces----------------------
Interface0
Name : aerial00
Network adapter : mlx5_0
PCIe bus id : 0000:01:00.0
Ethernet address : 94:6d:ae:f5:a9:12
Operstate : up
MTU : 1500
RX flow control : off
TX flow control : off
PTP hardware clock : 0
QoS Priority trust state : pcp
PCIe MRRS : N/A
High-quality Tx timestamp : on
Interface1
Name : aerial01
Network adapter : mlx5_0
PCIe bus id : 0000:01:00.1
Ethernet address : 94:6d:ae:f5:a9:13
Operstate : up
MTU : 1500
RX flow control : off
TX flow control : off
PTP hardware clock : 1
QoS Priority trust state : pcp
PCIe MRRS : N/A
High-quality Tx timestamp : on
Interface2
Name : aerial02
Network adapter : mlx5_1
PCIe bus id : 0002:01:00.0
Ethernet address : 94:6d:ae:f5:a0:e8
Operstate : up
MTU : 1500
RX flow control : off
TX flow control : off
PTP hardware clock : 2
QoS Priority trust state : pcp
PCIe MRRS : N/A
High-quality Tx timestamp : on
Interface3
Name : aerial03
Network adapter : mlx5_1
PCIe bus id : 0002:01:00.1
Ethernet address : 94:6d:ae:f5:a0:e9
Operstate : down
MTU : 1500
RX flow control : off
TX flow control : off
PTP hardware clock : 3
QoS Priority trust state : pcp
PCIe MRRS : N/A
High-quality Tx timestamp : on
-----Mellanox NICs--------------------------------
NIC1
NIC product name : BlueField3
NIC part number : 900-9D3B6-00CV-A_Ax
NIC PCIe bus id : /dev/mst/mt41692_pciconf1
NIC FW version : 32.41.1000
INTERNAL_CPU_MODEL : EMBEDDED_CPU(1)
INTERNAL_CPU_PAGE_SUPPLIER : EXT_HOST_PF(1)
INTERNAL_CPU_ESWITCH_MANAGER : EXT_HOST_PF(1)
INTERNAL_CPU_IB_VPORT0 : EXT_HOST_PF(1)
INTERNAL_CPU_OFFLOAD_ENGINE : DISABLED(1)
FLEX_PARSER_PROFILE_ENABLE : 4
PROG_PARSE_GRAPH : True(1)
ACCURATE_TX_SCHEDULER : True(1)
CQE_COMPRESSION : AGGRESSIVE(1)
REAL_TIME_CLOCK_ENABLE : True(1)
LINK_TYPE_P1 : ETH(2)
LINK_TYPE_P2 : ETH(2)
NIC2
NIC product name : BlueField3
NIC part number : 900-9D3B6-00CV-A_Ax
NIC PCIe bus id : /dev/mst/mt41692_pciconf0
NIC FW version : 32.41.1000
INTERNAL_CPU_MODEL : EMBEDDED_CPU(1)
INTERNAL_CPU_PAGE_SUPPLIER : EXT_HOST_PF(1)
INTERNAL_CPU_ESWITCH_MANAGER : EXT_HOST_PF(1)
INTERNAL_CPU_IB_VPORT0 : EXT_HOST_PF(1)
INTERNAL_CPU_OFFLOAD_ENGINE : DISABLED(1)
FLEX_PARSER_PROFILE_ENABLE : 4
PROG_PARSE_GRAPH : True(1)
ACCURATE_TX_SCHEDULER : True(1)
CQE_COMPRESSION : AGGRESSIVE(1)
REAL_TIME_CLOCK_ENABLE : True(1)
LINK_TYPE_P1 : ETH(2)
LINK_TYPE_P2 : ETH(2)
Checking the NIC Status#
To query back the Mellanox NIC firmware settings initialized with the script above, use these commands:
$ sudo mlxconfig -d /dev/mst/mt41692_pciconf0 q | grep "CQE_COMPRESSION\|PROG_PARSE_GRAPH\
\|ACCURATE_TX_SCHEDULER\|FLEX_PARSER_PROFILE_ENABLE\|REAL_TIME_CLOCK_ENABLE\|INTERNAL_CPU_MODEL\
\|LINK_TYPE_P1\|LINK_TYPE_P2\|INTERNAL_CPU_PAGE_SUPPLIER\|INTERNAL_CPU_ESWITCH_MANAGER\
\|INTERNAL_CPU_IB_VPORT0\|INTERNAL_CPU_OFFLOAD_ENGINE"
INTERNAL_CPU_MODEL EMBEDDED_CPU(1)
INTERNAL_CPU_PAGE_SUPPLIER EXT_HOST_PF(1)
INTERNAL_CPU_ESWITCH_MANAGER EXT_HOST_PF(1)
INTERNAL_CPU_IB_VPORT0 EXT_HOST_PF(1)
INTERNAL_CPU_OFFLOAD_ENGINE DISABLED(1)
FLEX_PARSER_PROFILE_ENABLE 4
PROG_PARSE_GRAPH True(1)
ACCURATE_TX_SCHEDULER True(1)
CQE_COMPRESSION AGGRESSIVE(1)
REAL_TIME_CLOCK_ENABLE True(1)
LINK_TYPE_P1 ETH(2)
LINK_TYPE_P2 ETH(2)
To check the current status of a NIC port, use this command:
$ sudo mlxlink -d /dev/mst/mt41692_pciconf0
Operational Info
----------------
State : Active
Physical state : LinkUp
Speed : 200G
Width : 4x
FEC : Standard_RS-FEC - (544,514)
Loopback Mode : No Loopback
Auto Negotiation : ON
Supported Info
--------------
Enabled Link Speed (Ext.) : 0x00003ff2 (200G_2X,200G_4X,100G_1X,100G_2X,100G_4X,50G_1X,50G_2X,40G,25G,10G,1G)
Supported Cable Speed (Ext.) : 0x000017f2 (200G_4X,100G_2X,100G_4X,50G_1X,50G_2X,40G,25G,10G,1G)
Troubleshooting Info
--------------------
Status Opcode : 0
Group Opcode : N/A
Recommendation : No issue was observed
Tool Information
----------------
Firmware Version : 32.41.1000
amBER Version : 3.2
MFT Version : mft 4.28.0-92
Alternatively, you can use the System Configuration Validation Script to obtain a full list of configuration settings.