Model Profiles#

A NIM Model profile defines what model engines NIM can use. Unique strings based on a hash of the profile contents identify each profile.

Users may select a profile at deployment time by following the Profile Selection steps. If the user does not manually select a profile at deployment time, NIM automatically chooses a generic, non-optimized profile. To understand how profiles and their corresponding engines are created, see How Profiles are Created.

Model profiles are embedded within the NIM container in a Model Manifest file, which is by default placed at /opt/nim/etc/config/default/model_manifest.yaml within the container filesystem.

Profile Selection#

To select a profile for deployment, set a specific profile ID with -e NIM_MANIFEST_PROFILE=<value>. You can choose a profile id for your GPU from the following list:

GPU

GPU Memory

Precision

Profile Id

H100 SXM

80

FP16

420b5bb2-cd51-4dac-be21-759f3df4e441

H100 PCIe

80

FP16

420b5bb2-cd51-4dac-be21-759f3df4e441

A100 SXM

80

FP16

3f5c5926-add5-402d-8877-c0798ffbb9e9

A100 PCIe

80

FP16

3f5c5926-add5-402d-8877-c0798ffbb9e9

L40S

48

FP16

3c28d914-ebbd-418b-8c5a-2a0da64bf4e3

A10G

24

FP16

d892ff5f-a51e-417b-bda8-63a004f4c3d7

A6000 Ada

48

FP16

a19c7bf8-b6c4-47b3-b519-0b67840c9951

RTX 4090

48

FP16

38ce5361-fd45-4d48-94b0-1ca7eb3c5d0b

If you run on an unsupported GPU, NIM chooses a generic, non-optimized profile with profile id afd81bb5-1b82-4816-a1bd-312dd380e4d1.

How Profiles are Created#

NIM microservices have two main categories of profiles: optimized and generic. optimized profiles are created for a subset of GPUs and models and leverage model- and hardware-specific optimizations intended to improve the performance of large language models. Over time, the breadth of models and GPUs for which optimized engines exist will increase. However, if an optimized engine does not exist for a particular model and GPU configuration combination, a generic backend is used as a fallback.

Currently, optimized profiles leverage pre-compiled TensorRT engines, while generic profiles utilize ONNX.

Quantization#

For some models and GPU configurations, quantized engines with reduced numerical precision are available. Currently, NV-CLIP NIM supports fp16 quantization for different GPU profiles.