Aerial System Scripts

Aerial CUDA-Accelerated RAN 24-2 Download PDF

Included in the release package is a script that checks and displays key system configuration settings that are important for running the Aerial cuBB SDK.

Copy
Copied!
            

$ pip3 install psutil $ cd $cuBB_SDK/cuPHY/util/cuBB_system_checks $ sudo -E python3 ./cuBB_system_checks.py

The output of cuBB_system_checks.py may differ slightly between bare-metal and container versions of the environment. The script helps to retrieve the software-component versions and hardware configuration. Refer to the Release Manifest in the cuBB Release Notes to ensure the correct software-component versions are installed. Below is an example output on a bare-metal platform:

Copy
Copied!
            

# To get the system or ptp info, the command has to run on the host. $ sudo -E python3 ./cuBB_system_checks.py --sys -----General-------------------------------------- Hostname : smc-gh-01 IP address : 192.168.1.100 Linux distro : "Ubuntu 22.04.3 LTS" Linux kernel version : 6.5.0-1019-nvidia -----System--------------------------------------- Manufacturer : Supermicro Product Name : ARS-111GL-NHR Base Board Manufacturer : Supermicro Base Board Product Name : G1SMH-G Chassis Manufacturer : Supermicro Chassis Type : Other Chassis Height : 1 U Processor : Grace A02 Max Speed : Unknown Current Speed : 3402 MHz $ sudo -E python3 ./cuBB_system_checks.py -----General-------------------------------------- Hostname : smc-gh-01 IP address : 192.168.1.100 Linux distro : "Ubuntu 22.04.3 LTS" Linux kernel version : 6.5.0-1019-nvidia -----Kernel Command Line-------------------------- Audit subsystem : audit=0 Clock source : N/A HugePage count : hugepages=32 HugePage size : hugepagesz=512M CPU idle time management : idle=poll Max Intel C-state : N/A Intel IOMMU : N/A IOMMU : N/A Isolated CPUs : N/A Corrected errors : N/A Adaptive-tick CPUs : nohz_full=4-47 Soft-lockup detector disable : nosoftlockup Max processor C-state : processor.max_cstate=0 RCU callback polling : rcu_nocb_poll No-RCU-callback CPUs : rcu_nocbs=4-47 TSC stability checks : tsc=reliable -----CPU------------------------------------------ CPU cores : 72 Thread(s) per CPU core : 1 CPU MHz: : N/A CPU sockets : 1 -----Environment variables------------------------ CUDA_DEVICE_MAX_CONNECTIONS : N/A cuBB_SDK : N/A -----Memory--------------------------------------- HugePage count : 32 Free HugePages : 31 HugePage size : 524288 kB Shared memory size : 240G -----Nvidia GPUs---------------------------------- GPU driver version : 555.42.02 CUDA version : 12.5 GPU0 GPU product name : NVIDIA GH200 480GB GPU persistence mode : Enabled Current GPU temperature : 36 C GPU clock frequency : 1980 MHz Max GPU clock frequency : 1980 MHz GPU PCIe bus id : 00000009:01:00.0 -----GPUDirect topology--------------------------- GPU0 NIC0 NIC1 NIC2 NIC3 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X SYS SYS SYS SYS 0-71 0 1 NIC0 SYS X PIX SYS SYS NIC1 SYS PIX X SYS SYS NIC2 SYS SYS SYS X PIX NIC3 SYS SYS SYS PIX X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks NIC Legend: NIC0: mlx5_0 NIC1: mlx5_1 NIC2: mlx5_2 NIC3: mlx5_3 -----Mellanox NICs-------------------------------- NIC0 NIC product name : BlueField3 NIC part number : 900-9D3B6-00CV-A_Ax NIC PCIe bus id : /dev/mst/mt41692_pciconf1 NIC FW version : 32.41.1000 FLEX_PARSER_PROFILE_ENABLE : 4 PROG_PARSE_GRAPH : True(1) ACCURATE_TX_SCHEDULER : True(1) CQE_COMPRESSION : AGGRESSIVE(1) REAL_TIME_CLOCK_ENABLE : True(1) NIC1 NIC product name : BlueField3 NIC part number : 900-9D3B6-00CV-A_Ax NIC PCIe bus id : /dev/mst/mt41692_pciconf0 NIC FW version : 32.41.1000 FLEX_PARSER_PROFILE_ENABLE : 4 PROG_PARSE_GRAPH : True(1) ACCURATE_TX_SCHEDULER : True(1) CQE_COMPRESSION : AGGRESSIVE(1) REAL_TIME_CLOCK_ENABLE : True(1) -----Mellanox NIC Interfaces---------------------- Interface0 Name : aerial00 Network adapter : mlx5_0 PCIe bus id : 0000:01:00.0 Ethernet address : 94:6d:ae:c7:62:00 Operstate : up MTU : 1514 RX flow control : off TX flow control : off PTP hardware clock : 0 QoS Priority trust state : pcp PCIe MRRS : 4096 bytes Interface1 Name : aerial01 Network adapter : mlx5_1 PCIe bus id : 0000:01:00.1 Ethernet address : 94:6d:ae:c7:62:01 Operstate : up MTU : 1500 RX flow control : off TX flow control : off PTP hardware clock : 1 QoS Priority trust state : pcp PCIe MRRS : 512 bytes Interface2 Name : aerial02 Network adapter : mlx5_2 PCIe bus id : 0002:01:00.0 Ethernet address : 94:6d:ae:c7:6b:80 Operstate : down MTU : 1500 RX flow control : on TX flow control : on PTP hardware clock : 2 QoS Priority trust state : pcp PCIe MRRS : 512 bytes Interface3 Name : aerial03 Network adapter : mlx5_3 PCIe bus id : 0002:01:00.1 Ethernet address : 94:6d:ae:c7:6b:81 Operstate : down MTU : 1500 RX flow control : on TX flow control : on PTP hardware clock : 3 QoS Priority trust state : pcp PCIe MRRS : 512 bytes -----Linux PTP------------------------------------ ● ptp4l.service - Precision Time Protocol (PTP) service Loaded: loaded (/lib/systemd/system/ptp4l.service; enabled; vendor preset: enabled) Active: active (running) since Wed 2024-06-05 21:42:18 UTC; 6h ago Docs: man:ptp4l Process: 4267 ExecStartPre=ethtool --set-priv-flags aerial01 tx_port_ts on (code=exited, status=0/SUCCESS) Process: 4386 ExecStartPre=ethtool -A aerial01 rx off tx off (code=exited, status=0/SUCCESS) Main PID: 4508 (ptp4l) Tasks: 1 (limit: 146899) Memory: 8.2M CPU: 17.936s CGroup: /system.slice/ptp4l.service └─4508 /usr/sbin/ptp4l -f /etc/ptp.conf Jun 06 03:45:21 smc-gh-01 ptp4l[4508]: [21807.308] rms 2 max 5 freq -1855 +/- 11 delay -96 +/- 0 Jun 06 03:45:22 smc-gh-01 ptp4l[4508]: [21808.308] rms 3 max 6 freq -1848 +/- 10 delay -96 +/- 0 Jun 06 03:45:23 smc-gh-01 ptp4l[4508]: [21809.308] rms 2 max 4 freq -1851 +/- 9 delay -96 +/- 1 Jun 06 03:45:24 smc-gh-01 ptp4l[4508]: [21810.308] rms 2 max 4 freq -1851 +/- 8 delay -97 +/- 1 Jun 06 03:45:25 smc-gh-01 ptp4l[4508]: [21811.308] rms 3 max 6 freq -1864 +/- 13 delay -96 +/- 0 Jun 06 03:45:26 smc-gh-01 ptp4l[4508]: [21812.308] rms 2 max 5 freq -1860 +/- 10 delay -96 +/- 0 Jun 06 03:45:27 smc-gh-01 ptp4l[4508]: [21813.308] rms 2 max 5 freq -1852 +/- 10 delay -97 +/- 0 Jun 06 03:45:28 smc-gh-01 ptp4l[4508]: [21814.308] rms 3 max 5 freq -1858 +/- 12 delay -96 +/- 1 Jun 06 03:45:29 smc-gh-01 ptp4l[4508]: [21815.308] rms 3 max 5 freq -1849 +/- 10 delay -97 +/- 0 Jun 06 03:45:30 smc-gh-01 ptp4l[4508]: [21816.308] rms 3 max 5 freq -1850 +/- 13 delay -97 +/- 0 ● phc2sys.service - Synchronize system clock or PTP hardware clock (PHC) Loaded: loaded (/lib/systemd/system/phc2sys.service; enabled; vendor preset: enabled) Active: active (running) since Wed 2024-06-05 21:42:20 UTC; 6h ago Docs: man:phc2sys Process: 4529 ExecStartPre=sleep 2 (code=exited, status=0/SUCCESS) Main PID: 4873 (sh) Tasks: 2 (limit: 146899) Memory: 2.1M CPU: 1min 14.399s CGroup: /system.slice/phc2sys.service ├─4873 /bin/sh -c "taskset -c 47 /usr/sbin/phc2sys -s /dev/ptp\$(ethtool -T aerial00 | grep PTP | awk '{print \$4}') -c CLOCK_REALTIME -n 24 -O 0 -R 256 -u 256" └─4878 /usr/sbin/phc2sys -s /dev/ptp0 -c CLOCK_REALTIME -n 24 -O 0 -R 256 -u 256 Jun 06 03:45:20 smc-gh-01 phc2sys[4878]: [21806.453] CLOCK_REALTIME rms 8 max 20 freq +8730 +/- 44 delay 512 +/- 0 Jun 06 03:45:21 smc-gh-01 phc2sys[4878]: [21807.469] CLOCK_REALTIME rms 8 max 20 freq +8758 +/- 36 delay 512 +/- 0 Jun 06 03:45:22 smc-gh-01 phc2sys[4878]: [21808.486] CLOCK_REALTIME rms 7 max 19 freq +8740 +/- 44 delay 512 +/- 3 Jun 06 03:45:23 smc-gh-01 phc2sys[4878]: [21809.502] CLOCK_REALTIME rms 7 max 18 freq +8749 +/- 35 delay 512 +/- 0 Jun 06 03:45:24 smc-gh-01 phc2sys[4878]: [21810.519] CLOCK_REALTIME rms 7 max 16 freq +8744 +/- 35 delay 512 +/- 0 Jun 06 03:45:25 smc-gh-01 phc2sys[4878]: [21811.535] CLOCK_REALTIME rms 8 max 21 freq +8722 +/- 55 delay 512 +/- 0 Jun 06 03:45:26 smc-gh-01 phc2sys[4878]: [21812.552] CLOCK_REALTIME rms 9 max 23 freq +8750 +/- 61 delay 512 +/- 2 Jun 06 03:45:28 smc-gh-01 phc2sys[4878]: [21813.570] CLOCK_REALTIME rms 8 max 20 freq +8749 +/- 49 delay 512 +/- 2 Jun 06 03:45:29 smc-gh-01 phc2sys[4878]: [21814.589] CLOCK_REALTIME rms 6 max 18 freq +8735 +/- 29 delay 512 +/- 2 Jun 06 03:45:30 smc-gh-01 phc2sys[4878]: [21815.608] CLOCK_REALTIME rms 7 max 18 freq +8762 +/- 40 delay 512 +/- 3 -----Software Packages---------------------------- cmake : N/A docker /usr/bin : 26.1.3 gcc /usr/bin : 11.4.0 git-lfs /usr/bin : 3.0.2 MOFED : N/A meson : N/A ninja : N/A ptp4l /usr/sbin : 3.1.1-3 -----Loaded Kernel Modules------------------------ GDRCopy : gdrdrv GPUDirect RDMA : N/A Nvidia : nvidia -----Non-persistent settings---------------------- VM swappiness : vm.swappiness = 0 VM zone reclaim mode : vm.zone_reclaim_mode = 0 -----Docker images--------------------------------

Checking the NIC Status

To query back the Mellanox NIC firmware settings initialized with the script above, use these commands:

Copy
Copied!
            

$ sudo mlxconfig -d /dev/mst/mt41692_pciconf0 q | grep "CQE_COMPRESSION\|PROG_PARSE_GRAPH\ \|ACCURATE_TX_SCHEDULER\|FLEX_PARSER_PROFILE_ENABLE\|REAL_TIME_CLOCK_ENABLE\|INTERNAL_CPU_MODEL\ \|LINK_TYPE_P1\|LINK_TYPE_P2\|INTERNAL_CPU_PAGE_SUPPLIER\|INTERNAL_CPU_ESWITCH_MANAGER\ \|INTERNAL_CPU_IB_VPORT0\|INTERNAL_CPU_OFFLOAD_ENGINE" INTERNAL_CPU_MODEL EMBEDDED_CPU(1) INTERNAL_CPU_PAGE_SUPPLIER EXT_HOST_PF(1) INTERNAL_CPU_ESWITCH_MANAGER EXT_HOST_PF(1) INTERNAL_CPU_IB_VPORT0 EXT_HOST_PF(1) INTERNAL_CPU_OFFLOAD_ENGINE DISABLED(1) FLEX_PARSER_PROFILE_ENABLE 4 PROG_PARSE_GRAPH True(1) ACCURATE_TX_SCHEDULER True(1) CQE_COMPRESSION AGGRESSIVE(1) REAL_TIME_CLOCK_ENABLE True(1) LINK_TYPE_P1 ETH(2) LINK_TYPE_P2 ETH(2)

To check the current status of a NIC port, use this command:

Copy
Copied!
            

$ sudo mlxlink -d /dev/mst/mt41692_pciconf0 Operational Info ---------------- State : Active Physical state : LinkUp Speed : 200G Width : 4x FEC : Standard_RS-FEC - (544,514) Loopback Mode : No Loopback Auto Negotiation : ON Supported Info -------------- Enabled Link Speed (Ext.) : 0x00003ff2 (200G_2X,200G_4X,100G_1X,100G_2X,100G_4X,50G_1X,50G_2X,40G,25G,10G,1G) Supported Cable Speed (Ext.) : 0x000017f2 (200G_4X,100G_2X,100G_4X,50G_1X,50G_2X,40G,25G,10G,1G) Troubleshooting Info -------------------- Status Opcode : 0 Group Opcode : N/A Recommendation : No issue was observed Tool Information ---------------- Firmware Version : 32.41.1000 amBER Version : 3.2 MFT Version : mft 4.28.0-92

Alternatively, you can use the System Configuration Validation Script to obtain a full list of configuration settings.

Previous Installing and Upgrading Aerial cuBB
Next Troubleshooting
© Copyright 2024, NVIDIA. Last updated on Jul 15, 2024.