Is this page helpful?

Environment Variables for NVIDIA NIM for Object Detection#

Use this documentation to learn about the environment variables for NVIDIA NIM for Object Detection.

Environment Variables#

Note

The following NIMs do not support NIM_SERVED_MODEL_NAME:

nemotron-graphic-elements-v1
nemoretriever-page-elements-v2
nemotron-page-elements-v3
nemotron-table-structure-v1
nemotron-ocr-v1

The following table identifies the environment variables that are used in the container. Set environment variables with the -e command-line argument to the docker run command.

Name	Description	Default Value
`NGC_API_KEY`	Set this variable to the value of your personal NGC API key.	None
`NIM_CACHE_PATH`	Specifies the fully qualified path, in the container, for downloaded models.	`/opt/nim/.cache`
`NIM_HTTP_API_PORT`	Specifies the network port number, in the container, for HTTP access to the microservice. Refer to Publishing ports in the Docker documentation for more information about host and container network ports.	`8000`
`NIM_HTTP_MAX_WORKERS`	Specifies the number of worker threads to start for HTTP requests.	`1`
`NIM_HTTP_TRITON_PORT`	Specifies the network port number, in the container, for NVIDIA Triton Inference Server.	`8080`
`NIM_IGNORE_MODEL_DOWNLOAD_FAIL`	When set to `true` and the microservice fails to download a model from NGC, the microservice continues to run rather than exit. This environment variable can be useful in an air-gapped environment.	`false`
`NIM_LOGGING_JSONL`	When set to `true`, the microservice creates log records in the JSONL format.	`false`
`NIM_LOG_LEVEL`	Specifies the logging level. The microservice supports the following values: DEBUG, INFO, WARNING, ERROR, and CRITICAL.	`INFO`
`NIM_MANIFEST_ALLOW_UNSAFE`	Set to `1` to enable selection of a model profile that is not included in the original `model_manifest.yaml` or a profile that is not detected to be compatible with the deployed hardware.	`0`
`NIM_MANIFEST_PATH`	Specifies the fully qualified path, in the container, for the model manifest YAML file.	`/opt/nim/etc/default/model_manifest.yaml`
`NIM_MODEL_PROFILE`	Specifies the model profile ID to use with the container. By default, the container attempts to automatically match the host GPU model and GPU count with the optimal model profile.	None
`NIM_REPOSITORY_OVERRIDE`	If set to a non-empty string, the `NIM_REPOSITORY_OVERRIDE` value replaces the hard-coded location of the repository and the protocol for access to the repository. The structure of the value for this environment variable is as follows: `<repository type>://<repository location>`. Only the protocols `ngc://`, `s3://`, and `https://` are supported, and only the first component of the URI is replaced. For example: - If the URI in the manifest is `ngc://org/meta/llama3-8b-instruct:hf?file=config.json` and `NIM_REPOSITORY_OVERRIDE=ngc://myrepo.ai/`, the domain name for the API endpoint is set to `myrepo.ai`. - If `NIM_REPOSITORY_OVERRIDE=s3://mybucket/`, the result of the replacement will be `s3://mybucket/nim%2Fmeta%2Fllama3-8b-instruct%3Ahf%3Ffile%3Dconfig.json`. - If `NIM_REPOSITORY_OVERRIDE=https://mymodel.ai/some_path`, the result of the replacement will be `https://mymodel.ai/some_path/nim%2Fmeta%2Fllama3-8b-instruct%3Ahf%3Ffile%3Dconfig.json`. This repository override feature supports basic authentication mechanisms: - `https` assumes authorization using the Authorization header and the credential value in `NIM_HTTPS_CREDENTIAL`. - `ngc` requires a credential in the `NGC_API_KEY` environment variable. - `s3` requires the environment variables `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and (if using temporary credentials) `AWS_SESSION_TOKEN`.	None
`NIM_SERVED_MODEL_NAME`	Specifies the model names used in the API. Specify multiple names in a comma-separated list. If you specify multiple names, the server responds to any of the names. The name in the model field of a response matches the name in the request. By default, the model name is inferred from the `/opt/nim/etc/default/model_manifest.yaml` file. For Prometheus metrics, this value is used in the `model_name` label. If more than one value is specified, the first one is used for the label.	None
`NIM_TELEMETRY_MODE`	Controls telemetry collection. Set to `0` to disable telemetry (default), set to `1` to enable telemetry. Telemetry helps NVIDIA improve performance, compatibility, and reliability while maintaining strict privacy protections. For more information, refer to NVIDIA’s Privacy Policy.	`0`
`NIM_TELEMETRY_ENABLE_ON_RTX`	Set to `"true"` to allow telemetry collection on RTX GPUs. Takes effect only if `NIM_TELEMETRY_MODE` is not `"0"`.	`false`
`NIM_TELEMETRY_INTERVAL_MINUTES`	Interval in minutes for periodic telemetry collection.	`30`
`NIM_TRITON_CUDA_MEMORY_POOL_MB`	For the NVIDIA Triton Inference Server, specify the byte size for the CUDA memory pool for all GPUs visible to the container.	By default, Image Retriever NIMs automatically set the CUDA memory pool based on the maximum input data size for the loaded TensorRT engine. However, you might want to increase the CUDA memory pool size when you enable dynamic batching or run highly concurrent workloads. A typical error message that indicates that you should increase the CUDA memory pool is `RuntimeError: CUDA error: invalid argument`. When you run NIMs with `NIM_TRITON_EXTRA_ARGS`, any arguments to the underlying `tritonserver` command are overridden. If you specify `NIM_TRITON_EXTRA_ARGS`, you must include the options that are automatically configured for the underlying `tritonserver` command. For an up-to-date reference for what these arguments are, run the NIM without `NIM_TRITON_EXTRA_ARGS`, inspect the arguments used for the `tritonserver` process inside the container, and then run the NIM with `NIM_TRITON_EXTRA_ARGS` and include the extra arguments.
`NIM_TRITON_DATA_MAX_BATCH_SIZE`	The maximum tensor size used for pre-processing and inference. Increase this if pre-processing takes too long. Decrease this if worker VRAM consumption is too high.	`32` for object detection and `16` for OCR.
`NIM_TRITON_DYNAMIC_BATCHING_MAX_QUEUE_DELAY_MICROSECONDS`	For the NVIDIA Triton Inference Server, sets the max queue delayed time to allow other requests to join the dynamic batch. It takes effect only when `NIM_TRITON_PERFORMANCE_MODE` is set to `throughput`. For more information, refer to the Triton User Guide.	`100us (microseconds)`
`NIM_TRITON_DYNAMIC_BATCHING_ENABLED`	True to enable the pipeline’s dynamic batching. We recommended that you only set this to False for debugging.	`True`
`NIM_TRITON_ENABLE_ASYNC_MODEL_EXECUTION`	True to enable concurrent model inference when the model’s max batch size is configured to be less than the data max batch size. If you enable this on a single-GPU deployment, you can experience GPU resource contention.	`False`
`NIM_TRITON_ENABLE_MODEL_CONTROL`	By default, Image Retriever NIMs start Triton with `--model-control-mode=none`. By setting this option to true, Triton starts with `--model-control-mode=explicit`. In explicit mode, models are not loaded as part of Triton startup, but instead by making an API call after the server has started. The API call is handled automatically by the NIM. When Triton raises an exception, Image Retriever NIMs search the error message for CUDA failures and reload the models to restore the NIM to a working state. This option is intended to improve stability of the NIM and to remove the need to restart the entire container when CUDA errors occur. For more information, refer to Model Management.	None
`NIM_TRITON_ENABLE_PIPELINE_TIMING`	True to enable detailed pipeline timing instrumentation. Adds ‘cuda.synchronize()’ calls for accurate GPU timing, which can degrade performance.	`False`
`NIM_TRITON_FLUSH_INTERVAL`	This option determines after how many requests the `NIM_TRITON_IDLE_BYTES_LIMIT` is checked.	1 (After every request)
`NIM_TRITON_GPU_DECODING_BATCH_THRESHOLD`	The batch size threshold for switching between CPU and GPU decoding.	`2`
`NIM_TRITON_GRPC_PORT`	Specifies the gRPC port number, for NVIDIA Triton Inference Server.	`8001`
`NIM_TRITON_IDLE_BYTES_LIMIT`	The threshold for idle VRAM memory (bytes) after which the Torch CUDA cache is emptied and all inter-process communication (IPC) files are closed.	1GB
`NIM_TRITON_LOG_VERBOSE`	When set to `1`, the container starts NVIDIA Triton Inference Server with verbose logging.	`0`
`NIM_TRITON_MAX_BATCH_SIZE`	Specify the maximum batch size that the underlying Triton instance can process. The value must be less than or equal to maximum batch size that was used to compile the engine. By default, the NIM uses the maximum possible batch size for a given model and GPU. To decrease the memory footprint of the server, choose a smaller maximum batch size. If the model uses the `tensorrt` backend, the value must exactly match a batch size in one of the engine’s profiles. Only discrete values are supported. For valid values, and their estimated memory footprint, refer to the support matrix page.	None
`NIM_TRITON_MAX_QUEUE_SIZE`	Sets the max queue size for the underlying Triton instance. For more information, refer to the Triton User Guide. Triton returns an InferenceServerException on new requests if you exceed the max queue size.	None
`NIM_TRITON_MAX_QUEUE_DELAY_MICROSECONDS`	The max queue delay for the BLS pipeline (Triton dynamic batching). Controls how many requests Triton aggregates before it calls the pipeline. Increase this value to improve throughput at the cost of increased latency. This is an alias for ‘NIM_TRITON_PIPELINE_MAX_QUEUE_DELAY_MICROSECONDS’ for backwards compatibility.	`10000`
`NIM_TRITON_MODEL_MAX_BATCH_SIZE`	The max batch size for the underlying model. For TensorRT, this value is validated against the engine profiles and is used to inform which TensorRT profile to load. This value is also used for inference chunking with the pipeline.	`32` for object detection and `16` for OCR.
`NIM_TRITON_MODEL_MAX_QUEUE_DELAY_MICROSECONDS`	The max queue delay for the underlying model’s dynamic batching. Controls how long Triton waits to batch inference calls to the model from the pipeline. For most scenarios, this should be kept at `0`.	`0`
`NIM_TRITON_MODEL_INSTANCE_COUNT`	The number of model instances to deploy.	`1`
`NIM_TRITON_PIPELINE_MAX_BATCH_SIZE`	The max batch size for the BLS pipeline (Triton dynamic batching). Controls how many requests Triton aggregates before it calls the pipeline.	`8192`
`NIM_TRITON_PIPELINE_MAX_QUEUE_DELAY_MICROSECONDS`	Max queue delay for the BLS pipeline (Triton dynamic batching). Controls how many requests Triton aggregates before calling the pipeline. Increase this value to improve throughput at the cost of increased latency.	‘10000’
`NIM_TRITON_PIPELINE_TIMING_INTERVAL`	The number of batches between pipeline timing summary prints.	`1`
`NIM_TRITON_PINNED_MEMORY_POOL_MB`	For the NVIDIA Triton Inference Server, specify the byte size for the pinned CPU memory pool that is used to transfer data from the host to the GPU device.	By default, `tritonserver` allocates 256MB of pinned system memory. However, you might want to increase the memory pool size to accelerate host-to-device data transfers at the cost of increased system memory usage.
`NIM_TRITON_RATE_LIMIT`	If set, this option configures Triton to rate limit the execution count throughout the ensemble model pipeline to the provided integer value. GPU-bound model inference is given priority, while other components of the ensemble model pipeline (pre-processors, post-processors, etc.) are given lower priority.	None
`NIM_TRITON_WORKER_INSTANCE_COUNT`	The number of pipeline worker instances to deploy. Assuming the pipeline’s dynamic batching parameters are set sufficiently high, this value should be increased to reduce request queuing at the cost of higher CPU, GPU, and VRAM resource use. For resource constrained deployments, keep this value set to `1`. The Pareto-optimal value is dependent on the workload, GPU, and available resources, but a value of `3` or `4` typically gives the best results.	`1`