Environment Variables for NVIDIA NeMo Retriever Reranking NIM#

Use this documentation to learn about the environment variables for NVIDIA NeMo Retriever Reranking NIM.

Binary Environment Variables#

The following table contains the binary environment variables.

Name

Default

Description

LOG_FORMAT

Pretty

The log emit format. One of: pretty, json, compact.

RUST_LOG

info

The tracing EnvFilter.

SHOW_CONFIG

false

True to print env-var help and exit.

Server Environment Variables#

The following table contains the server environment variables.

Name

Default

Description

NIM_BIND_ADDR

0.0.0.0:8000

The HTTP listen address (host:port).

NIM_GRPC_BIND_ADDR

-

The optional KServe V2 gRPC listen address (host:port). Empty disables gRPC.

NIM_GRPC_MAX_DECODING_MESSAGE_BYTES

-

The maximum inbound KServe gRPC message size in bytes. Empty uses the effective HTTP body limit.

NIM_MAX_QUEUE_SIZE

1024

The batcher request queue depth.

NIM_REQUEST_TIMEOUT_S

120

The timeout (s) for a single rerank request.

NIM_TLS_CERT_PATH

-

The path to PEM cert chain for HTTPS; empty means no TLS.

NIM_TLS_KEY_PATH

-

The path to PEM private key for HTTPS; empty means no TLS.

Pipeline Environment Variables#

The following table contains the pipeline environment variables.

Name

Default

Description

NIM_ADMISSION_SIZE

0

The override for admission semaphore size in passages. 0 = auto-derive (2 × engine_count × max_batch_size).

NIM_ENGINE_COUNT

1

The number of parallel inference engines. Ignored when NIM_ENGINE_DEVICES is set. For details, refer to Engine Count.

NIM_ENGINE_DEVICES

-

The comma-separated CUDA device ordinals for explicit engine placement (e.g. NIM_ENGINE_DEVICES=0,1,2,3). When set, one engine is started per listed device and NIM_ENGINE_COUNT is ignored. Duplicate ordinals are rejected. Default: unset (use NIM_ENGINE_COUNT engines on NIM_CUDA_DEVICE).

NIM_MAX_BATCH_SIZE

32

The max passages per engine forward pass.

NIM_MAX_CHUNK_SIZE

0

The override for chunk size on the chunked admission path (passages per chunk). 0 = auto-derive from max_batch_size.

NIM_MAX_SEQ_LEN

512

The max sequence length for tokenized (query, passage) pairs.

Engine Environment Variables#

The following table contains the engine environment variables.

Name

Default

Description

HF_TOKEN

-

The Hugging Face token for model download. Passed through to the model downloader. For details, refer to Get Started With NVIDIA NeMo Retriever Reranking NIM.

NGC_API_KEY

-

The NGC API key for model download when NIM_MODEL_DOWNLOAD_PROVIDER=ngc. Passed through to the model downloader. For details, refer to Get Started With NVIDIA NeMo Retriever Reranking NIM.

NIM_CUDA_DEVICE

0

The GPU device ID (0-indexed).

NIM_MODEL_DOWNLOAD_PROVIDER

hf

The model download provider. Use hf for Hugging Face; use ngc for the NVIDIA NGC Catalog.

NIM_MODEL_NAME

nvidia/llama-nemotron-rerank-vl-1b-v2

The model name returned in API responses.

NIM_MODEL_PATH

/model/rerank

The in-container path for staged model artifacts. The directory must contain artifacts for a supported model. For details, refer to Custom Model Artifact Support in NVIDIA NeMo Retriever Reranking NIM.

NIM_PRECISION

fp16

The LLM trunk precision. One of: fp16 (default), fp8, fp8-e4m3.

NIM_PRECOMPILE_ONLY

-

Compile all CUDA artifacts then exit 0 (passthrough).

NIM_SERVED_MODEL_NAME

-

Optional served API model alias. If both NIM_MODEL_NAME and NIM_SERVED_MODEL_NAME are configured, NIM_SERVED_MODEL_NAME currently takes precedence.