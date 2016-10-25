Various factors affect the performance of hardware accelerated transcoding on the GPU. Getting the highest performance for your workload requires some tuning. This section provides some tips for measuring and optimizing end-to-end transcode performance.

NVIDIA Video Codec SDK documentation publishes performance of GPU hardware accelerated encoder and decoder as stand-alone numbers, measured using high-performance encode or decode application included in the SDK. Although FFmpeg software is highly optimized, its performance is slightly lower than the performance reported in the SDK documentation, mainly due to software overheads and additional setup/initialization time within FFmpeg code. Therefore, to get high transcoding throughput using FFmpeg, it is essential to saturate the hardware encoder and decoder engines such that the initialization time overhead for one session gets hidden behind the transcoding time of other sessions. This can be achieved by running multiple parallel encode/decode sessions on the hardware (see Section 1:N HWACCEL encode from YUV or RAW Data). In such a case, the aggregate transcode performance with FFmpeg matches closely with the theoretically expected hardware performance.

Ensure the inputs have large number of frames (more than 15 seconds of video is recommended) so that initialization time overhead can be ignored.

6.2. Settings for Reduced Initialization Time#

To prepare longer videos for streamed distribution, they are typically split into smaller chunks and each chunk is encoded separately. Such chunk-based encoding avoids error propagation, provides clean boundaries for streaming bandwidth adaptation and helps parallelizing transcoding workloads on the servers. Transcoding smaller video chunks using GPU-hardware-accelerated transcoding, however, poses a challenge because the initialization time overhead of each FFmpeg process becomes significant.

To minimize the overhead when transcoding M input files into MN output files (i.e. when each of the M inputs is transcoded into N outputs), it is better to minimize the number of FFmpeg processes launched (see Section 1:N HWACCEL encode from YUV or RAW Data for example command lines).

Additionally, follow these tips to reduce the FFmpeg initialization time overhead:

Set the following environment variables:

export CUDA_VISIBLE_DEVICES=0 // (Use ID for the GPU device which you plan to use for transcode) export CUDA_DEVICE_MAX_CONNECTIONS=2

Use FFmpeg command lines such as those in Sections 1:N HWACCEL Transcode with Scaling and 1:N HWACCEL encode from YUV or RAW Data. These command lines share the CUDA context across multiple transcode sessions, thereby reducing the CUDA context initialization time overhead significantly.

