Model Profiles#

A NIM Model profile defines what model engines NIM can use. Unique strings based on a hash of the profile contents identify each profile.

Users may select a profile at deployment time by following the Profile Selection steps. If the user does not manually select a profile at deployment time, NIM automatically chooses a generic, non-optimized profile. To understand how profiles and their corresponding engines are created, see How Profiles are Created.

Model profiles are embedded within the NIM container in a Model Manifest file, which is by default placed at /opt/nim/etc/config/default/model_manifest.yaml within the container filesystem.

Profile Selection#

To select a profile for deployment, set a specific profile ID with -e NIM_MANIFEST_PROFILE=<value>. You can choose a profile id for your GPU from the following list:

GPU	GPU Memory	Precision	Profile Id
H100 SXM	80	FP16	420b5bb2-cd51-4dac-be21-759f3df4e441
H100 PCIe	80	FP16	420b5bb2-cd51-4dac-be21-759f3df4e441
A100 SXM	80	FP16	3f5c5926-add5-402d-8877-c0798ffbb9e9
A100 PCIe	80	FP16	3f5c5926-add5-402d-8877-c0798ffbb9e9
L40S	48	FP16	3c28d914-ebbd-418b-8c5a-2a0da64bf4e3
A10G	24	FP16	d892ff5f-a51e-417b-bda8-63a004f4c3d7
A6000 Ada	48	FP16	a19c7bf8-b6c4-47b3-b519-0b67840c9951
RTX 4090	48	FP16	38ce5361-fd45-4d48-94b0-1ca7eb3c5d0b

If you run on an unsupported GPU, NIM chooses a generic, non-optimized profile with profile id afd81bb5-1b82-4816-a1bd-312dd380e4d1.

How Profiles are Created#

NIM microservices have two main categories of profiles: optimized and generic. optimized profiles are created for a subset of GPUs and models and leverage model- and hardware-specific optimizations intended to improve the performance of large language models. Over time, the breadth of models and GPUs for which optimized engines exist will increase. However, if an optimized engine does not exist for a particular model and GPU configuration combination, a generic backend is used as a fallback.

Currently, optimized profiles leverage pre-compiled TensorRT engines, while generic profiles utilize ONNX.

Quantization#

For some models and GPU configurations, quantized engines with reduced numerical precision are available. Currently, NV-CLIP NIM supports fp16 quantization for different GPU profiles.