Running Aerial cuPHY#

Aerial cuPHY provides the cuPHY library and several examples that link with the library. Here we include instructions on using MATLAB to generate TVs. Please refer to Generating TV and Launch Pattern Files for using Aerial Python mcore Module to generate TVs.

Building Aerial cuPHY#

Prerequisites#

The following instructions assume the system configuration and Aerial cuBB installation are done. If not, see the cuBB Install Guide to complete the installation or upgrade process.

After powering on the system, use the following commands to verify that the GPU and NIC are in the correct state:

# Verify GPU is detected and CUDA driver version matches the release manifest.

$ nvidia-smi

Verify that the NIC is in the correct state on the host (this is only required to run cuBB end-to-end):

# Verify NIC is detected: Example CX6-DX

$ sudo lshw -c network -businfo

Bus info          Device     Class          Description
=======================================================
pci@0000:05:00.0  eno1       network        I210 Gigabit Network Connection
pci@0000:06:00.0  enp6s0     network        I210 Gigabit Network Connection
pci@0000:b5:00.0  ens6f0     network        MT2892 Family [ConnectX-6 Dx]
pci@0000:b5:00.1  ens6f1     network        MT2892 Family [ConnectX-6 Dx]

# Verify the link state is right. Assuming NIC port 0 is connected.

$ sudo mlxlink -d b5:00.0

Operational Info
----------------
State                           : Active
Physical state                  : LinkUp
Speed                           : 100G
Width                           : 4x
FEC                             : Standard RS-FEC - RS(528,514)
Loopback Mode                   : No Loopback
Auto Negotiation                : ON

Supported Info
--------------
Enabled Link Speed (Ext.)       : 0x000007f2 (100G_2X,100G_4X,50G_1X,50G_2X,40G,25G,10G,1G)
Supported Cable Speed (Ext.)    : 0x000002f2 (100G_4X,50G_2X,40G,25G,10G,1G)

Troubleshooting Info
--------------------
Status Opcode                   : 0
Group Opcode                    : N/A
Recommendation                  : No issue was observed.

Set Up the Host Environment#

Set up the environment by following the cuBB Installation Guide for the server type you are using.

Launch the cuBB Container#

Use the following command to launch the cuBB container:

$ sudo docker exec -it cuBB /bin/bash

Build Aerial cuPHY in the Container#

Build cuPHY in the cuBB container using the following commands:

$ cd /opt/nvidia/cuBB/cuPHY
$ cmake -Bbuild -GNinja -DCMAKE_TOOLCHAIN_FILE=cmake/toolchains/native -DCMAKE_INSTALL_PREFIX=./install
$ cmake --build build

cuPHY is, by default, built in Release mode. The option BUILD_DOCS=ON is also enabled by default to allow the make to generate the Doxygen documentation for the cuPHY library API. To disable this option, pass -DBUILD_DOCS=OFF to the CMake command line. The output directory is cuPHY/install/docs.

To put the built cuPHY headers and libraries into an installation directory so that other applications using the cuPHY library can compile and link with cuPHY, use the commands from the current build directory:

$ cmake --install build

This creates the include and lib directories under the cuPHY/install directory.

Building and running on separate servers#

When building the source code on one server, and running the binaries on another server, it might be important to use the correct toolchain for the target.

The source code directory cuPHY/cmake/toolchains contains toolchains for the following targets:

x86-64: devkit, r750, x86-64

arm: grace-cross, bf3

A new toolchain file might need to be created if using a different target.

The toolchain file defines what compiler to use, and the value of AERIAL_ARCH_TUNE_FLAGS

One way to make sure that the flag is correct, is to do the following:

Run the aerial_sdk container on the target, inside the container run the following command:

$ gcc -march=native -Q --help=target

Run the aerial_sdk container on the build server, inside the container run the following command:

$ gcc -march=<march for target> -Q --help=target

Make sure the outputs from both commands are the same. Create a toolchain file and use it when building aerial_sdk:

$ cmake -Bbuild -GNinja -DCMAKE_TOOLCHAIN_FILE=cmake/toolchains/my-target

Running the cuPHY Examples#

This section describes how to run the Aerial cuPHY standalone example programs. They read test vector data files as input. Refer to the Supported Test Vector Configurations section of the cuPHY Release Notes to determine which test vectors to use for different configurations. Do not use old test vectors from previous cuBB releases with the example programs of this release.

Generating Test Vectors using Matlab 5GModel#

Run this Matlab command:

cd('nr_matlab'); startup; [nTC, errCnt] = runRegression({'TestVector'}, {'allChannels'}, 'compact', [0, 1] );

All the cuPHY test vectors are generated and stored under nr_matlab/GPU_test_input.

Instructions for Testing cuPHY Channels Manually#

PUSCH#

Test Vectors

Match test vector name with PUSCH_gNB_CUPHY_*.h5

How to Run

Streams mode: cuPHY/build/examples/pusch_rx_multi_pipe/cuphy_ex_pusch_rx_multi_pipe -i ~/<tv_name>.h5
Graphs mode: cuPHY/build/examples/pusch_rx_multi_pipe/cuphy_ex_pusch_rx_multi_pipe -i ~/<tv_name>.h5 -m 1

Expected Outcome

Test 1 (CRC test KPI): All test cases must have zero CRC errors (only CRC errors, not correct ones, are reported when the channel is run).

PUCCH#

Test Vectors

Match test vector name with PUCCH_F*_gNB_CUPHY_*.h5

How to Run

PUCCH format 0/1/2/3: cuPHY/build/examples/pucch_rx_pipeline/cuphy_ex_pucch_rx_pipeline -i <tv_name>

Expected Outcome

cuphy_ex_pucch_Fx_receiver checks if the test vector includes PFx UCI first.
If the test-vector UCI format is not expected, it displays “No PFx UCI received”.

If the test-vector UCI format is expected, it compares UCI output.xzsd.

PRACH#

Test Vectors

Match test vector name with PRACH_gNB_CUPHY_*.h5

How to Run

cuPHY/build/examples/prach_receiver_multi_cell/prach_receiver_multi_cell -i <tv_name> -r <num_iteration> -k

Expected Outcome

prach_receiver_multi_cell compares against the reference measurements in the test vector.
Measured values are displayed and if they are within tolerance the message is displayed:
```
========> Test PASS
```

PDSCH#

Test Vectors

Match test vector name with PDSCH_gNB_CUPHY_*.h5

How to Run

PDSCH in non-AAS mode, streams: cuPHY/build/examples/pdsch_tx/cuphy_ex_pdsch_tx ~/<tv_name>.h5 2 0 0
PDSCH in non-AAS mode, graphs: cuPHY/build/examples/pdsch_tx/cuphy_ex_pdsch_tx ~/<tv_name>.h5 2 0 1

Expected Outcome

Test 1 (correctness against reference model): Channel reports correct match with reference model

PDCCH#

Test Vectors

Match test vector name with PDCCH_gNB_CUPHY_*.h5

How to Run

Streams mode: cuPHY/build/examples/pdcch/embed_pdcch_tf_signal -i ~/<tv_name>.h5 -m 0
Graphs mode: cuPHY/build/examples/pdcch/embed_pdcch_tf_signal -i ~/<tv_name>.h5 -m 1

Expected Outcome

Test 1 (correctness against reference model): Test PASS

SSB#

Test Vectors

Match test vector name with SSB_gNB_CUPHY_*.h5

How to Run

Streams mode: cuPHY/build/examples/ss/testSS -i ~/<tv_name>.h5 -m 0
Graphs mode: cuPHY/build/examples/ss/testSS -i ~/<tv_name>.h5 -m 1

Expected Outcome

Test 1 (correctness against reference model): Test PASS

CSI-RS#

Test Vectors

Match test vector name with CSIRS_gNB_CUPHY_*.h5

How to Run

Streams mode: cuPHY/build/examples/csi_rs/nzp_csi_rs_test -i <tv_name> -m 0
Graphs mode: cuPHY/build/examples/csi_rs/nzp_csi_rs_test -i <tv_name> -m 1

Expected Outcome

Test 1 (correctness against reference model): Test PASS

SRS#

Test Vectors

Match test vector name with SRS_gNB_CUPHY_*.h5

How to Run

Streams mode: cuPHY/build/examples/srs_rx_pipeline/cuphy_ex_srs_rx_pipeline -i <tv_name> -r <num_iteration> -m 0
Graphs mode: cuPHY/build/examples/srs_rx_pipeline/cuphy_ex_srs_rx_pipeline -i <tv_name> -r <num_iteration> -m 1

Expected Outcome

Test 1 (correctness against reference model): SRS reference check: PASSED!; Timing results are provided

BFC#

Test Vectors

Match test vector name with BFW_gNB_CUPHY_*.h5

How to Run

Streams mode: cuPHY/build/examples/bfc/cuphy_ex_bfc -i <tv_name> -r <num_iteration> -m 0
Graphs mode: cuPHY/build/examples/bfc/cuphy_ex_bfc -i <tv_name> -r <num_iteration> -m 1
Add -c to enable reference check (default disabled)

Expected Outcome

Test 1 (measure latency without reference check): Timing results are provided

Test 2 (correctness against reference model using -c): Test PASS; Timing results are provided

Instructions for LDPC Performance Test#

The ldpc_perf_collect.py Python script from the cuPHY repository can be used to perform error rate tests for the cuPHY LDPC decoder. There are test input files defined for Z = [64, 128, 256, 384], BG = [1,2]. The current tests check whether the block error rate (BLER, also sometimes referred to as Frame Error Rate or FER) is less than 0.1.

From the build directory, the following commands run the tests:

../util/ldpc/ldpc_perf_collect.py --mode test -i ../util/ldpc/test/ldpc_decode_BG1_Z64_BLER0.1.txt  -f -w 800 -P
../util/ldpc/ldpc_perf_collect.py --mode test -i ../util/ldpc/test/ldpc_decode_BG1_Z128_BLER0.1.txt -f -w 800 -P
../util/ldpc/ldpc_perf_collect.py --mode test -i ../util/ldpc/test/ldpc_decode_BG1_Z256_BLER0.1.txt -f -w 800 -P
../util/ldpc/ldpc_perf_collect.py --mode test -i ../util/ldpc/test/ldpc_decode_BG1_Z384_BLER0.1.txt -f -w 800 -P
../util/ldpc/ldpc_perf_collect.py --mode test -i ../util/ldpc/test/ldpc_decode_BG2_Z64_BLER0.1.txt  -f -w 800 -P
../util/ldpc/ldpc_perf_collect.py --mode test -i ../util/ldpc/test/ldpc_decode_BG2_Z128_BLER0.1.txt -f -w 800 -P
../util/ldpc/ldpc_perf_collect.py --mode test -i ../util/ldpc/test/ldpc_decode_BG2_Z256_BLER0.1.txt -f -w 800 -P
../util/ldpc/ldpc_perf_collect.py --mode test -i ../util/ldpc/test/ldpc_decode_BG2_Z384_BLER0.1.txt -f -w 800 -P

Each test input file contains multiple tests for different code rates, as specified by the number of parity nodes.

Running cuPHY Performance Testing Scripts#

aerial_sdk/testBenches provides a multi-cell multi-channel test bench to test cuPHY standalone performance. It relies on NVIDIA Multi-Process Service (MPS) to share the GPU among multiple channels. Specifically, there are two folders and their relationship can be summarized as follows:

cubb_gpu_test_bench: a C test bench that runs the multi-cell multi-channel cuPHY standalone GPU workload (that is, without I/O to and from NIC or layer 2). The input of cubb_gpu_test_bench are test vectors, a Yaml file, and some command options to run the GPU workload. The output is a buffer-XX.txt file that has the logs, channel start/end times, debug info, etc. Here XX is the number of cells used in testing.
perf: a set of Python scripts to automate performance testing using cubb_gpu_test_bench. The Python scripts can help generate the Yaml file and command options, config GPU and MPS before running cubb_gpu_test_bench; collect the test results by reading the output buffer-XX.txt from cubb_gpu_test_bench.

Generating Test Vectors using Matlab 5GModel#

Run this Matlab command:

cd <5GModel root>/nr_matlab
startup
genCfgTV_perf_ss('performance-avg.xlsm');
genCfgTV_perf_ss_bwc('performance-avg.xlsm');
genCfgTV_perf_pucch();
genCfgTV_perf_pdcch();
genCfgTV_perf_prach();
genCfgTV_perf_csirs();
genCfgTV_perf_ssb();
genCfgTV_perf_srs();

All the cuPHY Performance test vectors are generated and stored under nr_matlab/GPU_test_input.

Measuring cuPHY Performance using cubb_gpu_test_bench#

Requirements:

The performance measurements can be run using a Linux environment making one of more GPU available. Such environment is here assumed to have:
- bash or zsh as default shell
- Python 3.8+ and the following packages: numpy, pyCUDA, pyYAML
- CUDA toolkit 11.4 or above properly configured so that nvidia-cuda-mps-control and nvidia-smi are in PATH
- The executable cubb_gpu_test_bench is located in the <testBenches>/build folder.

There are three steps when measuring cell capacity using cubb_gpu_test_bench. The perf folder provides some pre defined test cases. Below is an example of 4T4R (F08) using TDD pattern DDDSUUDDDD.

Generate the JSON file that defines the use case (e.g., 8~16 peak or average cells)

python3 generate_avg_TDD.py --peak 8 9 10 11 12 13 14 15 16 --avg 0 --exact --case F08

Measure the latency of all channels based on predefined patterns
```
python3 measure.py --cuphy <testBenches>/build --vectors <test_vectors> --config testcases_avg_F08.json --uc uc_avg_F08_TDD.json --delay 100000 --gpu <GPU_ID> --freq <GPU_freq> --start <cell_start> --cap <cell_cap> --iterations 1 --slots <nSlots> --power <budget> --target <sms_prach> <sms_pdcch> <sms_pucch> <sms_pdsch> <sms_pusch> <sms_ssb> --2cb_per_sm --save_buffer --priority --prach --prach_isolate --pdcch --pdcch_isolate --pucch --pucch_isolate --tdd_pattern dddsuudddd --pusch_cascaded --ssb --csirs --groups_dl --pack_pdsch --groups_pusch --ldpc_parallel <--graph>
```
- <GPU_ID>: The ID of the GPU on which the measurements are to be run (e.g. 0 for single GPU systems)
- <GPU_freq>: The GPU clock frequency in MHz
- <cell_start>: The minimum number of cells to be tested
- <cell_cap>: The maximum number of cells to be tested. The Python scripts will run cubb_gpu_test_bench for a range of [<cell_start>, <cell_cap>] cells and collect the latency results.
- <budget>: The power budget in Watts
- <sms_channelName>: The number of streaming multiprocessors used per MPS sub-context for each channel during the run, where channelName can be “PRACH”, “PDCCH”, “PUCCH”, “PDSCH”, “PUSCH”, or “SSB”
- <--graph> Runs the measurement in graph mode. If this parameter is not included, stream mode will be used.
Note

Use --test to see which YAML file and command options the Python scripts generated without running the tests on GPU.
Visualize the latency of each channel (this step requires Python library matplotlib). We generate a compare-<date>.png file showing the CDF of the latency for all tested channels:
- If run in stream mode:
```
python3 compare.py --filename <sms_prach>_<sms_pdcch>_<sms_pucch>_<sms_pdsch>_<sms_pusch>_<sms_ssb>_sweep_streams_avg_F08.json --cells <nCell>+0
```
- If run in graph mode:
```
python3 compare.py --filename <sms_prach>_<sms_pdcch>_<sms_pucch>_<sms_pdsch>_<sms_pusch>_<sms_ssb>_sweep_graphs_avg_F08.json --cells <nCell>+0
```
  Where <nCell> is the number of cells we would like to visualize the latency results
It is possible to compare latency results of different number of cells in one figure. For instance, we can compare the latency of 8 cells and 9 cells:
```
python3 compare.py --filename <sms_prach>_<sms_pdcch>_<sms_pucch>_<sms_pdsch>_<sms_pusch>_<sms_ssb>_sweep_graphs_avg_F08.json <sms_prach>_<sms_pdcch>_<sms_pucch>_<sms_pdsch>_<sms_pusch>_<sms_ssb>_sweep_graphs_avg_F08.json --cells 8+0 9+0
```
In all cases, Aerial CUDA-Accelerated RAN offers the possibility of measuring the latency of all workloads including:
- Dynamic and heterogeneous traffic (meaning that each cell is stimulated with different test vectors and every slot sees a different allocation of the test vectors to the considered cells)
- Specific traffic models