NVIDIA Performance Primitives (NPP)  Version 10.0
NVIDIA Performance Primitives

Note: The static NPP libraries depend on a common thread abstraction layer library called cuLIBOS (libculibos.a) that is now distributed as part of the toolkit. Consequently, cuLIBOS must be provided to the linker when the static library is being linked against. To minimize library loading and CUDA runtime startup times it is recommended to use the static library(s) whenever possible. To improve loading and runtime performance when using dynamic libraries, NPP recently replaced it with a full set of nppi sub-libraries. Linking to only the sub-libraries that contain functions that your application uses can significantly improve load time and runtime startup performance. Some nppi functions make calls to other nppi and/or npps functions internally so you may need to link to a few extra libraries depending on what function calls your application makes. The nppi sub-libraries are split into sections corresponding to the way that nppi header files are split. This list of sub-libraries is as follows:

* nppial arithmetic and logical operation functions in nppi_arithmetic_and_logical_operations.h
* nppicc color conversion and sampling functions in nppi_color_conversion.h
* nppicom JPEG compression and decompression functions in nppi_compression_functions.h
* nppidei data exchange and initialization functions in nppi_data_exchange_and_initialization.h
* nppif filtering and computer vision functions in nppi_filter_functions.h
* nppig geometry transformation functions found in nppi_geometry_transforms.h
* nppim morphological operation functions found in nppi_morphological_operations.h
* nppist statistics and linear transform in nppi_statistics_functions.h and nppi_linear_transforms.h
* nppisu memory support functions in nppi_support_functions.h
* nppitc threshold and compare operation functions in nppi_threshold_and_compare_operations.h
*

For example, on Linux, to compile a small application foo using NPP against the dynamic library, the following command can be used:

* nvcc foo.c -lnppi -o foo
*

Whereas to compile against the static NPP library, the following command has to be used:

* nvcc foo.c -lnppi_static -lculibos -o foo
*

It is also possible to use the native host C++ compiler. Depending on the host operating system, some additional libraries like pthread or dl might be needed on the linking line. The following command on Linux is suggested:

* g++ foo.c -lnppi_static -lculibos -lcudart_static -lpthread -ldl
* -I <cuda-toolkit-path>/include -L <cuda-toolkit-path>/lib64 -o foo
*

NPP is a stateless API, as of NPP 6.5 the ONLY state that NPP remembers between function calls is the current stream ID, i.e. the stream ID that was set in the most recent nppSetStream() call and a few bits of device specific information about that stream. The default stream ID is 0. If an application intends to use NPP with multiple streams then it is the responsibility of the application to call nppSetStream() whenever it wishes to change stream IDs. For best performance the application should first call nppGetStream() and only call nppSetStream() if the stream ID needs to change, nppSetStream() will internally call cudaStreamSynchronize() if necessary before changing stream IDs. Several NPP functions may call other NPP functions internally to complete their functionality. For this reason it is recommended that cudaDeviceSynchronize() (or at least cudaStreamSynchronize()) be called before making an nppSetStream() call to change to a new stream ID. This will insure that any internal function calls that have not yet occurred will be completed using the current stream ID before it changes to a new ID. Calling cudaDeviceSynchronize() frequently can kill performance so minimizing the frequency of these calls is critical for good performance. It is not necessary to call cudaDeviceSynchronize() for stream management while the same stream ID is used for multiple NPP calls. All NPP functions should be thread safe except for the following functions:

What is NPP?

NVIDIA NPP is a library of functions for performing CUDA accelerated processing. The initial set of functionality in the library focuses on imaging and video processing and is widely applicable for developers in these areas. NPP will evolve over time to encompass more of the compute heavy tasks in a variety of problem domains. The NPP library is written to maximize flexibility, while maintaining high performance.

NPP can be used in one of two ways:

Either route allows developers to harness the massive compute resources of NVIDIA GPUs, while simultaneously reducing development times.

Documentation

Technical Specifications

Supported Platforms:
    - Microsoft Windows 7, 8, and 10 (64-bit and 32-bit)
    - Microsoft Windows Vista (64-bit and 32-bit) 
    - Linux (Centos, Ubuntu, Red Hat, and several others) (64-bit and 32-bit)
    - Mac OS X (64-bit)
    - Android on Arm (32-bit and 64-bit)

Files

NPP is comprises the following files:

Header Files

All those header files are located in the CUDA Toolkit's

/include/ 

directory.

Library Files

Starting with Version 5.5 NPP's functionality is now split up into 3 distinct library groups:

On the Windows platform the NPP stub libraries are found in the CUDA Toolkit's library directory:

/lib/nppc.lib 
/lib/nppial.lib 
/lib/nppicc.lib 
/lib/nppicom.lib 
/lib/nppidei.lib 
/lib/nppif.lib 
/lib/nppig.lib 
/lib/nppim.lib 
/lib/nppist.lib 
/lib/nppisu.lib 
/lib/nppitc.lib 
/lib/npps.lib 

The matching DLLs are located in the CUDA Toolkit's binary directory. Example

/bin/nppial64_100_<build_no>.dll      // Dynamic image-processing library for 64-bit Windows. 

On Linux and Mac platforms the dynamic libraries are located in the lib directory

/lib/libnppc.so.10.0.<build_no>   // NPP dynamic core library for Linux

/lib/libnpps.10.0.dylib  // NPP dynamic signal processing library for Mac   

Supported NVIDIA Hardware

NPP runs on all CUDA capable NVIDIA hardware. For details please see http://www.nvidia.com/object/cuda_learn_products.html