Aerial Data Lake#
6G will be artificial intelligence (AI) native. AI and machine learning (ML) will extend through all aspects of next generation networks from the radio, baseband processing, the network core including system management, orchestration and dynamic optimization processes. GPU hardware, together with programming frameworks will be essential to realize this vision of a software defined native-AI communication infrastructure.
The application of AI/ML in the physical layer has particularly been a hot research topic.
There is no AI without data. While the synthetic data generation capabilities of Aerial Omniverse Digital Twin (AODT) and Sionna/SionnaRT are essential aspects of a research project, availability of over-the-air (OTA) waveform data from real-time systems is equally important. This is the role of Aerial Data Lake. It is a data capture platform supporting the capture of OTA radio frequency (RF) data from virtual radio access network (vRAN) networks built on the Aerial CUDA-Accelerated RAN. Aerial Data Lake consists of a data capture application (app) running on the base station (BS) distributed unit (DU), a database of samples collected by the app, and an application programming interface (API) for accessing the database.
Target Audience#
Industry and university researchers and developers looking to bring ML to the physical layer with the end goal of benchmarking on OTA testbeds like NVIDIA ARC-OTA or other GPU-based BSs.
Key Features#
Aerial Data Lake has the following features:
Real-time capture of RF data from OTA testbed
Aerial Data Lake is designed to operate with gnBs built on the Aerial CUDA-Accelerated RAN and that employ the Small Cell Forum FAPI interface between L2 and L1. One example system being the NVIDIA ARC-OTA network testbed. I/Q samples from O-RUs connected to the GPU platform via a O-RAN 7.2x split fronthaul interface are delivered to the host CPU and exported to the Aerial Data Lake database.
Aerial Data Lake APIs to access the RF database
The data passed to the layer-2 via RX_Data.Indication and UL_TTI.Request are exported to the database. The fields in these data structures form the basis of the database access APIs.
Scalable and time coherent over arbitrary number of BSs
The data collection app runs on the same CPU that supports the DU. It runs on a single core, and the database runs on free cores. Because each BS is responsible for collecting its own uplink data, the collection process scales as more BSs are added to the network testbed. Database entires are time-stamped so data collected over multiple BSs can be used in a training flow in a time-coherent manner.
Use in conjunction with pyAerial to generate training data for neural network physical layer designs
Aerial Data Lake can be used in conjunction with the NVIDIA pyAerial CUDA-Accelerated Python L1 library. Using the Data Lake database APIs, pyAerial can access RF samples in a Data Lake database and transform those samples into training data for all the signal processing functions in an uplink or downlink pipeline.
Design#
Aerial Data Lake sits beside the Aerial L1 and copies out data that would be useful for machine learning into an external database.

Figure 1: The Aerial Data Lake data capture platform as part of the gNB.#
Uplink I/Q data from one or more O-RAN radio units (O-RUs) is delivered to GPU memory where it is both processed by the Aerial L1 PUSCH baseband pipeline and delivered to host CPU memory. The Aerial Data Lake collector process writes the I/Q samples to the Aerial Data Lake database in the fh table. The fh table has columns for SFN, Slot, IQ samples as fhData, and the start time of that SFN.slot as TsTaiNs.
The collector app saves data that the L2 sent to L1 to describe UL OTA transmissions in UL_TTI.Request messages as well as data returned to the L2 via RX_Data.Indication and CRC.Indication. This data is then written to the fapi database table. These messages and the fields within them are described in SCF 5G FAPI PHY Spec version 10.02, sections 3.4.3, 3.4.7, and 3.4.8.
Each gNB in a network testbed collects data from all O-RUs associated with it. That is, data collection over the span of a network is performed in a distributed manner, each gNB is building its own local database. Training can be performed locally at each gNB, and site-specific optimizations can be realized with this approach. Since the data in a database is time-stamped, the local databases can be consolidated at a centralized compute resource and training performed using the time aligned aggregated data. In cases where the aerial pusch pipeline was unable to decode due to channel conditions, retransmissions can be used as ground truth as long as one of the retransmissions succeeds, allowing the user to test algorithms with better performance than the originals.
The Aerial Data Lake database storage requirements depend on the number of O-RUs, the antenna configuration of the O-RU, the carrier bandwidth, the TDD pattern and the number of samples to be collected. Collecting IQ samples of 1 million tranmissions from a single RU 4T4R O-RU employing a single 100MHz carrier will consume approximately 660 GB of storage.
Aerial Data Lake database comprises the fronthaul RF data. However, for many training applications access to data at other nodes in the receive pipeline is required. A pyAerial pipeline, together with the Data Lake database APIs, can access samples from an Aerial Data Lake database and transform that data into training data for any function in the pipeline.
Figure 2 illustrates data ingress from a Data Lake database into a pyAerial pipeline and using standard Python file I/O to generate training data for a soft de-mapper.

Figure 2: pyAerial is used in conjunction with the NVIDIA data collection platform, namely, Aerial Data Lake to build training data sets for any node in the layer-1 downlink or uplink signal processing pipeline. The example shows a Data Lake database of over-the-air samples transformed into training data for a neural network soft de-mapper.#
Installation#
Aerial Data Lake is compiled by default as part of cuphycontoller. If you would like to record fresh data every time cuphycontroller is started, see the section on Fresh Data.
Start by installing Clickhouse database on the server collecting the data. The command below will download and run an instance of the clickhouse server in a docker container.
docker run -d \
--network=host \
-v $(realpath ./ch_data):/var/lib/clickhouse/ \
-v $(realpath ./ch_logs):/var/log/clickhouse-server/ \
--cap-add=SYS_NICE --cap-add=NET_ADMIN --cap-add=IPC_LOCK \
--name my-clickhouse-server --ulimit nofile=262144:262144 clickhouse/clickhouse-server
By default clickhouse will not drop large tables, and will return an error if attempted. The clickhouse-cpp library does not return exceptions so to avoid what looks like a cuphycontroller crash we recommend allowing it to drop large tables using the following command:
sudo touch './ch_data/flags/force_drop_table' && sudo chmod 666 './ch_data/flags/force_drop_table'
Usage#
In the cuphycontoller adapter yaml configuration file, enable data collection by specifying a core then start cuphycontroller as usual. The core should be on the same NUMA node as the rest of cuphycontroller, i.e. should follow the same pattern as the rest of the cores. An example of this can be found commented out in cuphycontroller_P5G_FXN_R750.yaml.
cuphydriver_config:
# Fields added for data collection
datalake_core: 19 # Core on which data collection runs. E.g isolated odd on R750, any isolated core on gigabyte
datalake_address: localhost
datalake_samples: 1000000 # Number of samples to collect for each UE/RNTI. Defaults to 1M
When enabled the DataLake object is created and DataLake::dbInit() initializes the two tables in the database. After cuphycontroller runs the PUSCH pipeline, cupycontroller calls DataLake::notify() with the addresses of the data to be saved, which DataLake then saves. When DataLake::waitForLakeData wakes up it calls DataLake::dbInsert() which appends data to respective Clickhouse columns, then sleeps waiting for more data. Once 50 PUSCH transmissions have been stored or a total of datalake_samples have been recived the columns are appended to a Clickhouse::Block and inserted into the respective table.
Multi-Cell#
Datalakes can be configured to capture data from multiple cells controlled by the same L1. The Jupyter notebook datalake_pusch_multicell.ipynb shows an example of using data captured from multiple cells. To capture the data for this example, cell 41 was controlled by testmac and cell 51 was controlled by a real L2. In order to do this the cuphycontroller L2 interface needs to be configured to work with two cells and L2s, and testmac needs to be configured to use /dev/shm/nvipc1 rather than /dev/shm/nvipc. L2 should use the slot pattern DDDSU. Core allocations will need to be adjusted to suite the server being used.
Using Data Lake in Notebooks#
Follow pyAerial instructions to build and launch that container. It must be run on a server with a GPU.
See the pyAerial examples section for details.
Database Administration#
Note
$
and this denotes a clickhouse client prompt
aerial-gnb :)
Database Import#
There are example fapi and fh tables included in Aerial CUDA-Accelerated RAN container. These tables can be imported into the clickhouse database by copying them from the container to the clickhouse user_files folder, then using the client to import them:
$ docker cp cuBB:/opt/nvidia/cuBB/pyaerial/notebooks/data/fh.parquet .
$ docker cp cuBB:/opt/nvidia/cuBB/pyaerial/notebooks/data/fapi.parquet .
$ sudo cp *.parquet ./ch_data/user_files/
A clickhouse client is needed to interact with the server. To download it and run it do the following:
curl https://clickhouse.com/ | sh
./clickhouse client
aerial@aerial-gnb:~$ ./clickhouse client
ClickHouse client version 24.3.1.1159 (official build).
Connecting to localhost:9000 as user default.
Connected to ClickHouse server version 24.3.1.
aerial-gnb :)
This is the clickhouse client prompt. Use the client to import the sample data into the clickhouse server using these commands:
aerial-gnb :) create table fh ENGINE = MergeTree primary key TsTaiNs settings allow_nullable_key=1 as select * from file('fh.parquet',Parquet)
aerial-gnb :) create table fapi ENGINE = MergeTree primary key TsTaiNs settings allow_nullable_key=1 as select * from file('fapi.parquet',Parquet)
Now check that they have been imported:
aerial-gnb :) select table, formatReadableSize(sum(bytes)) as size from system.parts group by table
The output will look similar to this:
SELECT
`table`,
formatReadableSize(sum(bytes)) AS size
FROM system.parts
GROUP BY `table`
Query id: 95451ea7-6ea9-4eec-b297-15de78036ada
┌─table───────────────────┬─size───────┐
│ fh │ 5.55 MiB │
│ fapi │ 3.88 KiB │
└─────────────────────────┴────────────┘
You now have three slots of PUSCH transmissions from 5-6 real UEs recieved by two cells loaded in the database and can run the example notebooks.
Database Queries#
To show some information about the entries (rows) you can run the following at the clickhouse client prompt:
Show counts of transmissions for all RNTIs
aerial-gnb :) select rnti, count(*) from fapi group by rnti
Output:
SELECT
rnti,
count(*)
FROM fapi
GROUP BY rnti
Query id: 603141a2-bc02-4950-8e9e-1d3f366263c6
┌──rnti─┬─count()─┐
│ 1624 │ 3 │
│ 20000 │ 3 │
│ 20216 │ 3 │
│ 47905 │ 2 │
│ 53137 │ 2 │
│ 57375 │ 3 │
│ 62290 │ 3 │
└───────┴─────────┘
Show select information from all rows of the fapi table
aerial-rf-gnb :) from fapi select TsTaiNs,SFN,Slot,nUEs,rbStart,rbSize,tbCrcStatus,CQI order by TsTaiNs,rbStart
Output:
SELECT
TsTaiNs,
SFN,
Slot,
nUEs,
rbStart,
rbSize,
tbCrcStatus,
CQI
FROM fapi
ORDER BY
TsTaiNs ASC,
rbStart ASC
Query id: f42d9192-1de1-4cc6-b3eb-932b22ecab3e
┌───────────────────────TsTaiNs─┬─SFN─┬─Slot─┬─nUEs─┬─rbStart─┬─rbSize─┬─tbCrcStatus─┬───────CQI─┐
│ 2024-07-19 10:42:46.272000000 │ 391 │ 4 │ 7 │ 0 │ 8 │ 1 │ -7.352562 │
│ 2024-07-19 10:42:46.272000000 │ 391 │ 4 │ 7 │ 0 │ 5 │ 0 │ 31.75534 │
│ 2024-07-19 10:42:46.272000000 │ 391 │ 4 │ 7 │ 5 │ 5 │ 0 │ 30.275444 │
│ 2024-07-19 10:42:46.272000000 │ 391 │ 4 │ 7 │ 10 │ 5 │ 0 │ 31.334328 │
│ 2024-07-19 10:42:46.272000000 │ 391 │ 4 │ 7 │ 15 │ 5 │ 0 │ 30.117304 │
│ 2024-07-19 10:42:46.272000000 │ 391 │ 4 │ 7 │ 20 │ 5 │ 0 │ 29.439499 │
│ 2024-07-19 10:42:46.272000000 │ 391 │ 4 │ 7 │ 25 │ 248 │ 0 │ 25.331459 │
│ 2024-07-19 10:42:47.292000000 │ 493 │ 4 │ 6 │ 0 │ 8 │ 1 │ -7.845479 │
│ 2024-07-19 10:42:47.292000000 │ 493 │ 4 │ 6 │ 0 │ 5 │ 0 │ 29.412682 │
│ 2024-07-19 10:42:47.292000000 │ 493 │ 4 │ 6 │ 5 │ 5 │ 0 │ 30.186537 │
│ 2024-07-19 10:42:47.292000000 │ 493 │ 4 │ 6 │ 10 │ 5 │ 0 │ 30.366463 │
│ 2024-07-19 10:42:47.292000000 │ 493 │ 4 │ 6 │ 15 │ 5 │ 0 │ 29.590645 │
│ 2024-07-19 10:42:47.292000000 │ 493 │ 4 │ 6 │ 20 │ 253 │ 0 │ 28.494812 │
│ 2024-07-19 10:42:48.212000000 │ 585 │ 4 │ 6 │ 0 │ 8 │ 1 │ -8.030928 │
│ 2024-07-19 10:42:48.212000000 │ 585 │ 4 │ 6 │ 0 │ 5 │ 0 │ 31.359173 │
│ 2024-07-19 10:42:48.212000000 │ 585 │ 4 │ 6 │ 5 │ 5 │ 0 │ 30.353489 │
│ 2024-07-19 10:42:48.212000000 │ 585 │ 4 │ 6 │ 10 │ 5 │ 0 │ 29.3033 │
│ 2024-07-19 10:42:48.212000000 │ 585 │ 4 │ 6 │ 15 │ 5 │ 0 │ 28.298597 │
│ 2024-07-19 10:42:48.212000000 │ 585 │ 4 │ 6 │ 20 │ 253 │ 0 │ 26.621593 │
└───────────────────────────────┴─────┴──────┴──────┴─────────┴────────┴─────────────┴───────────┘
19 rows in set. Elapsed: 0.002 sec.
Show start times of fh table
aerial-rf-gnb :) from fh select TsTaiNs,TsSwNs,SFN,Slot,CellId,nUEs
Output:
SELECT
TsTaiNs,
TsSwNs,
SFN,
Slot,
CellId,
nUEs
FROM fh
Query id: 6926d88e-6e9c-4818-b127-aef96913cfc0
┌───────────────────────TsTaiNs─┬────────────────────────TsSwNs─┬─SFN─┬─Slot─┬─CellId─┬─nUEs─┐
│ 2024-07-19 10:42:46.272000000 │ 2024-07-19 10:42:46.273113183 │ 391 │ 4 │ 41 │ 7 │
│ 2024-07-19 10:42:46.272000000 │ 2024-07-19 10:42:46.273113183 │ 391 │ 4 │ 51 │ 7 │
│ 2024-07-19 10:42:47.292000000 │ 2024-07-19 10:42:47.293139202 │ 493 │ 4 │ 41 │ 6 │
│ 2024-07-19 10:42:47.292000000 │ 2024-07-19 10:42:47.293139202 │ 493 │ 4 │ 51 │ 6 │
│ 2024-07-19 10:42:48.212000000 │ 2024-07-19 10:42:48.213139622 │ 585 │ 4 │ 41 │ 6 │
│ 2024-07-19 10:42:48.212000000 │ 2024-07-19 10:42:48.213139622 │ 585 │ 4 │ 51 │ 6 │
└───────────────────────────────┴───────────────────────────────┴─────┴──────┴────────┴──────┘
6 rows in set. Elapsed: 0.002 sec.
Fresh Data#
The database of IQ samples grows quite quickly. If you want to get fresh data on every run, you can automatically remove the tables by uncommenting the following lines in cuPHY-CP/data_lakes/data_lakes.cpp:
//dbClient->Execute("DROP TABLE IF EXISTS fapi");
//dbClient->Execute("DROP TABLE IF EXISTS fh");
Dropping Data#
You can manually drop all of the data from the database with these commands:
aerial-gnb :) drop table fh
aerial-gnb :) drop table fapi
Notes and Known Limitations#
Currently datalake converts complex half floating point values to floats in c++ which takes ~2ms per cell. During that time, when samples are being inserted into the database, PUSCH notifications can be missed and a note will be printed in the phy log:
[CTL.DATA_LAKE] Notify not called for 39.4 dbInsert busy