NVIDIA Tegra
DRIVE 5.0 Linux Open Source Software

Development Guide
5.0.10.3 Release


 
Performance: Achieving Improvements
 
File System Type and Mounting Options
USB Storage Drive
Enumerate the USB Storage as a Super-speed Device
Additional USB Devices Sharing USB bandwidth
Estimate Throughput Achievement
Effects of Page-cache
Latency Sensitive Applications that Run in Bursts
Running check_file_cash.sh
This topic provides a review of write performance issues with a USB drive and provides realistic approaches to address them. Input/output side bottlenecks are addressed to help enhance performance using a USB drive.
To tackle performance issues, evaluate if the performance issues are due to input/output side bottlenecks or CPU-side bottlenecks. CPU-side bottlenecks are not addressed in this topic.
File System Type and Mounting Options
Verify the filesystem type and mounting options to see if improvements can be made. The file system used on USB storage and the mounting options can affect performance. Carefully review the file system used on the drive and the mounting options to improve performance numbers.
In the SDK, the default mount option is asynchronous mode. The configuration file to modify the mount option, using the provided rootfs, is available at:
/etc/usbmount/usbmount.conf
To check the current file system and mounting options used, execute the following mount command:
mount | grep /dev/sd
/dev/sda on /media/usb0 type ext4 (rw,nodev,noexec,noatime,nodiratime,data=ordered)
The following can be seen from the output:
1. ext4 is the file system.
ext4 offers a good mix of performance and journaling capabilities.
Other file systems, such as btrfs or ReiserFS, can be benchmarked to see if it offers benefits to the current usecase.
Asynchronous mounting is used and is indicated by the absence of the sync keyword.
2. If synchronous mount is performed, the following mount command output is displayed.
mount | grep /dev/sd
/dev/sda on /media/usb0 type ext4 (rw,nodev,sync,noexec,noatime,nodiratime,data=ordered)
When media is mounted using the sync option:
All read/writes performed on the media are blocked until the data is fetched/written from/to media.
When writes are performed, write-combining is not executed.
As a result, poor performance numbers are exhibited. Avoid synchronous mount to alleviate performance issues. Use asynchronous mount to enable software cache and provide better performance.
3. The file system independent options noatime,nodiratime indicates that file access time bookkeeping is disabled and this hampers read/write performance.
4. Use of ext4, (that offers journaling capability of data/metadata) file system, dependent option data=ordered (or data=writeback) to disable data journaling and enhance performance.
USB Storage Drive
The capability of USB storage drives varies widely. Even with high-speed (USB 2.0) or super-speed (USB 3.0) USB drives, a wide variety of performance numbers are displayed depending on the vendor and the storage medium used (hard drive or a solid-state drive). Whatever storage solution used, it must be benchmarked to determine if it meets your throughput requirements. For best results, use an SSD-based USB storage device with USB 3.0 or better capability.
For example, the following has been seen to offer good performance numbers:
Siig USB 3.1 to SATA 2.5 External Hard Drive Enclosure USB 3.1 (Gen 2) + Samsung 850 PRO - 256GB - 2.5-Inch SATA III Internal SSD (MZ-7KE256BW)
Note:
Tegra codename Parker systems support up to USB 3.0 speeds for USB interfaces.
Perform the following benchmarking checks.
Enumerate the USB Storage as a Super-speed Device
USB topology or incompatibility between the USB drive and the Tegra system affects performance numbers. Ensure the USB storage drive is correctly enumerated in super-speed mode as follows:
1. Connect the USB 3.0 drive to the USB 3.0 port on the device.
Consult the Hardware Connectors in Setting Up Your DRIVE PX 2 Platform to locate the USB port on your device.
2. Use the following command to verify the enumeration status.
lsusb -t
/: Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci-tegra/3p, 5000M
|__ Port 1: Dev 2, If 0, Class=Mass Storage, Driver=usb-storage, 5000M
/: Bus 01.Port 1: Dev 1, Class=root_hub, Driver=xhci-tegra/4p, 480M
Where:
5000M indicates that the USB storage device is enumerated as a super-speed device.
If the speed is 480M or lower, ensure that a certified USB 3.0 cable is used to connect the drive to a USB 3.0 port. If a hub is used, ensure that the hub supports super-speed.
Additional USB Devices Sharing USB bandwidth
Super-speed devices, in a single USB bus, share USB bandwidth. Some root ports can also share bandwidth with other root ports. For example, in Tegra codename Parker systems ports 2 and 3 share super-speed bandwidth. Consequently, to ensure the best performance numbers, restrict USB devices used to the minimum number needed. Additionally, benchmark with and without additional USB devices to examine how the performance is affected.
Estimate Throughput Achievement
USB 3.0 devices offer, at best, 5 Gbps of throughput. However, for write speeds, most products offer 2.5 Gbps of throughput. Additionally, system load affects USB write performance. For example, consider a usecase where camera data is captured, the media file being played on the display while USB writes are being performed. For the best write performance achievements, benchmark the USB drive while replicating different use cases that apply to your product.
The following example demonstrates how best to benchmark. The example assumes the USB storage is mounted at /media/usb0 and has more than 10 GB of free space.
To benchmark without special applications in the background
Execute the following command.
rm -rf /media/usb0/* ; sync ; echo 3 > /proc/sys/vm/drop_caches ; dd if=/dev/zero of=/media/usb0/4GB bs=1G count=10 oflag=direct,sync
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 27.1229 s, 396 MB/s
Where: direct,sync options are used with the dd command to bypass page caching so that raw USB throughput can be determined. These options directly correspond to O_DIRECT/O_SYNC flags that can be used with the open() system call.
To benchmark with expected usecases
USB storage devices must be benchmarked in an environment that resembles actual system and application load where the USB read/write application is run. Run applications that emulate the typical expected usecases when USB writes are performed.
For example, run the provided sample camera capture application with the gears application to test bandwidth usage during multiple applications accessing the display.
On console 1:
startx & ;
export DISPLAY=:0
./x11/gears -1 &
./nvipp_ssc -cf ./e2379b_c01.conf -c dvp-ar0231-rccb-raw12-1920x1208-ab -v 1 -d 0 -w 1 --aggregate 2
On console 2:
rm -rf /media/usb0/* ; sync ; echo 3 > /proc/sys/vm/drop_caches ; dd if=/dev/zero of=/media/usb0/4GB bs=1G count=10 oflag=direct,sync
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 28.14 s, 382 MB/s
As expected, the USB write performance has dropped. Repeat the exercise with typical usecases that are expected in the system to determine the least USB write performance that can be observed. These results provide the best-case performance number possible in the system. Applications performing USB writes must be designed with expectations of average write performance (over a long period of time) below these numbers.
Effects of Page-cache
Page cache is used by applications to help guard against performance loss due to arbitrary write sizes and the burst nature of writes. Bypassing page caching, as demonstrated in the above benchmarking steps, average raw throughput expectations with USB, over a long period of time, under expected system load is realized.
However, applications do not resemble the dd command in the way writes are performed. Typical applications tend to write data in bursts with some time gap between individual writes. For example, if camera images are captured and stored in USB@15FPS, a write of 1 frame every ~66 ms is expected. Also, the amount of data written per write call is typically in the order of a few megabytes or less. USB throughput at the bus level also tends to be in bursts.
Page caches delineate application writes from actual writes to the USB storage device. Writes coming in from the application space are buffered and written out later by the virtual memory (VM) subsystem.
To benchmark with page cache enabled with a realistic application level write size
Execute the following command.
#rm -rf /media/usb0/* ; sync ; echo 3 > /proc/sys/vm/drop_caches ; dd if=/dev/zero of=/media/usb0/4GB bs=9M count=455
455+0 records in
455+0 records out
4293918720 bytes (4.3 GB) copied, 8.89803 s, 483 MB/s
The resulting write performance is misleading. The 483 MB per second result indicates that, given the current work load of 9M sized writes for 455 times, the write performance is on an average 483 MB per second. However, if the application scenario involves writing to the USB drive for a long period of time, this write performance is expected to converge to the number obtained previously with page caches disabled and large sized writes (382 MBps). The following experiment confirms that:
To benchmark with page cache over a longer period of time
Execute the following command.
#rm -rf /media/usb0/* ; sync ; echo 3 > /proc/sys/vm/drop_caches ; dd if=/dev/zero of=/media/usb0/4GB bs=9M count=10000
10000+0 records in
10000+0 records out
94371840000 bytes (94 GB) copied, 244.447 s, 386 MB/s
In essence, page-caches must be used by applications to avoid latency issues. However, if page caches are used during a benchmarking exercise, it must be performed over a sufficient period of time to avoid getting artificially inflated numbers.
Latency Sensitive Applications that Run in Bursts
For applications that are latency sensitive and tend to run in bursts, adjusting the Virtual Memory configuration parameters, and the type of read/write calls used, can avoid I/O throttling that can cause the latency issues. Use the following guidelines when faced with such cases:
Asynchronous writes—guard against latencies due to system call delays by using asynchronous I/O calls (aio_write() system call) over synchronous I/O calls (write() system call).
Page cache size—Page caches are limited on the basis of the available RAM at any given point in time. The maximum size to which page caches can grow is determined based on these Virtual Memory system configuration numbers:
#sysctl -a | grep -i 'dirty_background\|dirty_ratio\|dirty_bytes'
vm.dirty_background_bytes = 0
vm.dirty_background_ratio = 10
vm.dirty_bytes = 0
vm.dirty_ratio = 20
These numbers can be configured as an absolute number (dirty_background_bytes / dirty_bytes) or as a percentage of available RAM at a given moment (dirty_background_ratio / dirty_ratio).
For purposes of this document, ratios are used instead of absolute numbers. Kernel documentation of these parameters is available at:
Kernel/Documentation/sysctl/vm.txt
The relevant section extractions include:
dirty_background_ratio—contains, as a percentage of total available memory that contains free pages and reclaimable pages, the number of pages at which the background kernel flusher threads will start writing out dirty data.
dirty_ratio—contains, as a percentage of total available memory that contains free pages and reclaimable pages, the number of pages at which a process which is generating disk writes will itself start writing out dirty data.
Even though page caching delineates an application write() calls from the actual transfer of the data to a storage device, when the page cache grows to more than 50% of (vm.dirty_background_ratio + vm.dirty_ratio)*Available RAM, the write calls are throttled for several milliseconds until the data reaches the storage device. This leads to unpredictable latencies on the application side.
The only way to guard against this is to evaluate the appropriate values of vm.dirty_background_ratio and vm.dirty_ratio for the given system and increase them if necessary. However, increasing the size of page-cache also means that data is buffered in RAM for longer durations. In case of abrupt power-cuts or system hangs, there can be data loss.
Running check_file_cache.sh in the background, while running the USB write application can show if the I/O throttling occurs by printing the following message:
>> file cache (1465880 kB) is exceeding maximum threshold (40/2 percent of 7270196 kB)
If check_file_cache.sh is terminated with CTRL+C after the USB write application ends, corrective action of adjusting dirty_background_ratio / dirty_ratio can help avoid I/O throttling is also displayed as follows:
>> Least amount of memory available during this run: 7259748 kB
>> Maximum file cache used during this run : 1631580 kB
>> Please adjust either dirty_background_ratio (or) dirty_ratio such that:
>> 1631580 <= ((dirty_background_ratio+dirty_ratio)/200)*7259748
Apply the same process to the sample dd application using the following procedure:
1. Start check_file_cache.sh before starting the dd application.
2. After the dd application ends, end check_file_cache.sh by pressing CTRL+C, and perform the following commands.
For console 1:
7274592 kB Dirty: 0 kB Writeback: 0 kB
MemAvailable: 7272428 kB Dirty: 621704 kB Writeback: 581784 kB
file cache (1154192 kB) is exceeding maximum threshold (30/2 percent of 7272668 kB)
MemAvailable: 7272668 kB Dirty: 621576 kB Writeback: 532616 kB
file cache (1105080 kB) is exceeding maximum threshold (30/2 percent of 7272668 kB)
MemAvailable: 7272668 kB Dirty: 621576 kB Writeback: 483504 kB
^C****
Least amount of memory available during this run: 7260508 kB
Maximum file cache used during this run: 1222872 kB
Please adjust either dirty_background_ratio (or) dirty_ratio such that:
1222872 <= ((dirty_background_ratio+dirty_ratio)/200)*7260508
****
For console 2:
#rm -rf /media/usb0/* ; sync ; echo 3 > /proc/sys/vm/drop_caches ; #dd if=/dev/zero of=/media/usb0/4GB bs=9M count=455
455+0 records in
455+0 records out
4293918720 bytes (4.3 GB) copied, 8.89803 s, 483 MB/s
As shown by the above output, I/O throttling does not occur. Adjust the dirty_ratio setting to a higher number to see if it helps:
For console 1:
#echo 40 > /proc/sys/vm/dirty_ratio
#./check_file_cache.sh
50/2 percent of available memory can be used as file cache w/o IO throttling
MemAvailable: 7274592 kB Dirty: 0 kB Writeback: 0 kB
^C****
Least amount of memory available during this run: 7259396 kB
Maximum file cache used during this run : 1768280 kB
Please adjust either dirty_background_ratio (or) dirty_ratio such that:
1768280 <= ((dirty_background_ratio+dirty_ratio)/200)*7259396
****
For console 2:
#rm -rf /media/usb0/* ; sync ; echo 3 > /proc/sys/vm/drop_caches ; #dd if=/dev/zero of=/media/usb0/4GB bs=9M count=455
455+0 records in
455+0 records out
4293918720 bytes (4.3 GB) copied, 7.53168 s, 570 MB/s
Now I/O throttling did not occur and this solves the latency issue (as indicated by output from check_file_cache.sh and also the throughput numbers).
Tip:
1. Run check_file_cache.sh for the entire duration of the application that performs USB writes. If the file caches are continuously increasing, as indicated by I/O throttling continuing to occur despite increasing vm.dirty_background_ratio and vm.dirty_ratio to the fullest extent, then I/O throttling cannot be avoided given the application scenario and test setup.
2. Adjusting VM configuration does not increase the average USB throughput, over an infinite period of time. The average throughput that can be expected is still to be considered as shown in To benchmark with page cache over a longer period of time.
Running check_file_cash.sh
The check_file_cash.sh script is as follows.
#cat check_file_cache.sh
!/bin/bash
 
function finish() {
echo "****"
echo "Least amount of memory available during this run: $worst_available kB"
echo "Maximum file cache used during this run : $worst_cache kB"
echo " Please adjust either dirty_background_ratio (or) dirty_ratio such that:"
echo " $worst_cache <= ((dirty_background_ratio+dirty_ratio)/200)*$worst_available"
echo "****"
}
 
trap finish EXIT
 
dirty_br=$(cat /proc/sys/vm/dirty_background_ratio)
dirty_r=$(cat /proc/sys/vm/dirty_ratio)
 
(( percentage = dirty_r + dirty_br ))
 
echo $percentage/2 percent of available memory can be used as file cache w/o IO throttling
meminfo=$(cat /proc/meminfo | grep -i 'Available:\|dirty\|writeback:')
echo $meminfo
worst_available=$(echo $meminfo | awk '{print $2}')
worst_cache=0
 
while true; do
meminfo=$(cat /proc/meminfo | grep -i 'Available:\|dirty\|writeback:')
#echo $meminfo
available=$(echo $meminfo | awk '{print $2}')
dirty=$(echo $meminfo | awk '{print $5}')
writeback=$(echo $meminfo | awk '{print $8}')
 
(( cache = writeback + dirty ))
(( max_cache = percentage * available ))
(( cachex = cache * 200 ))
 
if [ $available -le $worst_available ]
then
(( worst_available = available ))
fi
 
if [ $cache -ge $worst_cache ]
then
(( worst_cache = cache ))
fi
 
# check if cache >= ((0.5*(dirty_r+dirty_br))/100)*available
if [ $cachex -ge $max_cache ]
then
echo file cache '('$cache kB')' is exceeding maximum threshold '('$percentage/2 percent of $available kB')'
echo $meminfo
fi
sleep 0.1
done