Advanced Usage#

This section provides a detailed breakdown of the inferencing script for more advanced users.

Background Noise Removal NIM uses gRPC endpoints. Import the compiled gRPC protos for invoking the NIM.

import os
import sys
import grpc

sys.path.append(os.path.join(os.getcwd(), "../interfaces/bnr"))
# Importing gRPC compiler auto-generated BNR library
import bnr_pb2, bnr_pb2_grpc  # noqa: E402

The NIM invocation uses bi-directional gRPC streaming. To generate the request data stream, define a Python generator function. This is also known as a Python iterator of form, a simple function that yields after a call. The yield returns a chunk to be streamed.

def generate_request_for_inference(input_filepath: os.PathLike) -> None:
    """Generator to produce the request data stream

    Args:
      input_filepath: Path to input file
    """
    DATA_CHUNKS = 64 * 1024  # bytes; we send the wav file in 64-KB chunks
    with open(input_filepath, "rb") as fd:
        while True:
            buffer = fd.read(DATA_CHUNKS)
            if buffer == b"":
                break
            yield bnr_pb2.EnhanceAudioRequest(audio_stream_data=buffer)

Note

For NIM in streaming mode, the audio_stream_data in a request should be PCM 32-bit float audio data, with each data chunk of size 10 milliseconds for any BNR model.

Before invoking the NIM, define a function that handles the incoming stream and writes it to an output file.

from typing import Iterator

def write_output_file_from_response(
    response_iter: Iterator[bnr_pb2.EnhanceAudioResponse],
    output_filepath: os.PathLike
) -> None:
    """Function to write the output file from the incoming gRPC data stream.

    Args:
      response_iter: Responses from the server to write into output file
      output_filepath: Path to output file
    """
    with open(output_filepath, "wb") as fd:
        for response in response_iter:
            if response.HasField("audio_stream_data"):
                fd.write(response.audio_stream_data)

Note

For NIM in streaming mode, the output audio_stream_data in the response is PCM 32-bit float audio data of the same length as the input or request.

Now that we have the request generator and output iterator set up, connect to the NIM and invoke it. The input file path is stored in the variable input_filepath, and the output file is written to the location specified in the variable output_filepath. Wait for a message confirming that the function invocation has completed before checking the output file. Fill in the correct host and port for your target in the following code snippet:

import time

input_filepath = "../assets/bnr_48k_input.wav"
output_filepath = "bnr_48k_output.wav"

# For connecting to a NIM without SSL, open an `insecure_channel(...)`.
# For connecting to a NIM with TLS/mTLS, open a `secure_channel(...)` 
# with required root certificate, client private key, and certificate.
with grpc.insecure_channel(target="localhost:8001") as channel:
    try:
        stub = bnr_pb2_grpc.MaxineBNRStub(channel)
        start_time = time.time()

        responses = stub.EnhanceAudio(
            generate_request_for_inference(input_filepath=input_filepath),
            metadata=None,
        )

        write_output_file_from_response(response_iter=responses, output_filepath=output_filepath)

        end_time = time.time()
        print(
            f"Function invocation completed in {end_time-start_time:.2f}s. The output file is generated."
        )
    except BaseException as e:
        print(e)

Compile the Protos (Optional)#

The NVIDIA Maxine NIM Clients package comes with the pre-compiled protos. However, to compile the protos locally, install the required dependencies.

With the compilation script, you can generate compiled protos that match your specific programming language (such as Python or C++) and version requirements, ensuring compatibility between client applications and compiled protos in your custom implementations.

Linux#

To compile protos on Linux, run:

# Go to bnr/protos folder
cd bnr/protos

chmod +x compile_protos.sh
./compile_protos.sh

Windows#

To compile protos on Windows, run:

# Go to bnr\protos folder
cd bnr\protos

compile_protos.bat

Model Caching#

When the container starts for the first time, it downloads the required models from NGC. To avoid downloading the models on subsequent runs, you can cache them locally by using a cache directory:

# Create the cache directory on the host machine
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
chmod 777 $LOCAL_NIM_CACHE

# Run the container with the cache directory mounted in the appropriate location
docker run -it --rm --name=bnr \
    --runtime=nvidia \
    --gpus all \
    --shm-size=8GB \
    -e NGC_API_KEY=$NGC_API_KEY \
    -e NIM_MODEL_PROFILE=<nim_model_profile> \
    -e MAXINE_MAX_CONCURRENCY_PER_GPU= 1 \
    -p 8000:8000 \
    -p 8001:8001 \
    -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
    nvcr.io/nim/nvidia/maxine-bnr:latest

Ensure the nim_model_profile is compatible with your GPU. For more information about nim_model_profile, refer to the NIM Model Profile Table.

Multiple Concurrent Inputs#

To run the server in multi-input concurrent mode, set the environment variable MAXINE_MAX_CONCURRENCY_PER_GPU to an integer greater than 1 in the server container. The server will then accept as many concurrent inputs per GPU as specified by the MAXINE_MAX_CONCURRENCY_PER_GPU variable.

Because Triton distributes the workload equally across all GPUs, if the count of GPUs equals NUM_GPUS, the total number of concurrent inputs supported by the server is NUM_GPUS * MAXINE_MAX_CONCURRENCY_PER_GPU.

Note

The number of concurrent streams supported by the NIM vary with the amount of resources available on the system, including CPUs, system memory, and GPU VRAM. If the system does not have enough resources to load the server, server launch will fail with error logs similar to the following:

OpenBLAS blas_thread_init: pthread_create failed for thread 14 of 16: Resource temporarily unavailable

If such an error is encountered, reduce the number of streams (MAXINE_MAX_CONCURRENCY_PER_GPU) when launching the container.