NIM CLI#

The NIM CLI is a downloadable binary that makes it easy to experiment with NVIDIA NIM for LLMs.

Early release disclaimer#

This is in an early release state. Some known limitations:

  • Does not work on Ubuntu < 20

  • Has not been tested thoroughly on Ubuntu < 22

  • Cannot use chat subcommand with LoRAs

  • Is currently hard wired to the 24.05-rc3

  • Can only serve traffic on port 8000 on the host machine

Installation and Setup#

You can download the CLI from the nvidian/nim-llm-dev Resource section via the browser or this NGC CLI command:

ngc registry resource download-version nvidian/nim-llm-dev/nim-cli

It is a single binary file built for x86 architecture. You will need to chmod +x nim to make it executable after download and add it to your path.

If you do not have the ngc CLI tool, refer to the NGC CLI documentation for information on downloading and configure the tool.

Inspecting the available commands#

Inspect the commands available using the -h help flag.

nim -h

Output:

NVIDIA LLM Inference Command Line Inference (CLI)

Usage: nim <COMMAND>

Commands:
  list       List models (see subcommands)
  run        Run a NIM
  status     Check the status of running NIMs
  logs       Tail the logs for a running NIM
  stop       Stop a NIM
  benchmark  Run benchmarks against a running NIM
  help       Print this message or the help of the given subcommand(s)

Options:
  -h, --help     Print help
  -V, --version  Print version

List available NIMs - nim list all#

The command nim list all can be used to list the NIMs available to run.

nim list all

Output:

+--------------------------+
| Model                    |
+--------------------------+
| meta/llama3-70b-instruct |
+--------------------------+
| meta/llama3-8b-instruct  |
+--------------------------+

Running a NIM - nim run <model-id>#

To run a NIM, use the command nim run <model-id>. The following NIM CLI command downloads the meta-llama3-8b-instruct model if it isn’t already downloaded and prepares it so you can run inferences against it.

nim run meta-llama3-8b-instruct

Output:

[NIM] Docker Run Command:

#!/bin/bash

docker run --gpus='"device=1"' --rm -ti  --label name=nim__meta-llama3-8b-instruct__port-8000 \
  -v ~/.cache/ngc:/home/nvs/.cache \
  -e "MODEL_NAME=meta-llama3-8b-instruct" \
  -e NGC_API_KEY=$NGC_API_KEY \
  -p "8000:8000" \
  nvcr.io/nvidian/nim-llm-dev/meta-llama3-8b-instruct:24.05.rc3

===========================================
== NVIDIA Inference Microservice LLM NIM ==
===========================================

NVIDIA Inference Microservice LLM NIM Version 24.05
Model: /model-store/meta-llama3-8b-instruct

meta-llama3-8b-instruct
Downloading model from NGC...
Checking meta-llama3-8b-instruct versions...
Downloading hf version of meta-llama3-8b-instruct...
...
... <logging ommitted>
...
INFO:     Started server process [115]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO 05-14 01:14:20 metrics.py:229] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%

Our NIM is now running! You can check the models available deployed in that NIM by sending a GET request to the v1/models endpoint:

curl -X GET localhost:8000/v1/models

Output:

{
  "object": "list",
  "data": [
    {
      "id": "meta-llama3-8b-instruct",
      "object": "model",
      "created": 1715649304,
      "owned_by": "vllm",
      "root": "meta-llama3-8b-instruct",
      "parent": null,
      "permission": [
        {
          "id": "modelperm-2a8eb3aa92fb42f0b22525f2faec0b23",
          "object": "model_permission",
          "created": 1715649304,
          "allow_create_engine": false,
          "allow_sampling": true,
          "allow_logprobs": true,
          "allow_search_indices": false,
          "allow_view": true,
          "allow_fine_tuning": false,
          "organization": "*",
          "group": null,
          "is_blocking": false
        }
      ]
    }
  ]
}

Running a NIM w/ LoRAs#

First download some meta-llama3-8b-instruct LoRAs to ~/loras on your localhost:

export LOCAL_PEFT_DIRECTORY=~/loras
mkdir $LOCAL_PEFT_DIRECTORY && cd $LOCAL_PEFT_DIRECTORY

# downloading .nemo loras
ngc registry model download-version "nvidian/nim-llm-dev/llama3-8b-instruct-lora:nemo-math-v1"
ngc registry model download-version "nvidian/nim-llm-dev/llama3-8b-instruct-lora:nemo-squad-v1"

# downloading HF loras
ngc registry model download-version "nvidian/nim-llm-dev/llama3-8b-instruct-lora:hf-math-v1"
ngc registry model download-version "nvidian/nim-llm-dev/llama3-8b-instruct-lora:hf-squad-v1"

chmod -R 777 $LOCAL_PEFT_DIRECTORY

Then run the NIM:

nim run meta-llama3-8b-instruct --peft-path $LOCAL_PEFT_DIRECTORY

Note that as of 24.05.rc3, the LoRAs in the directory passed to --peft-path must be for the base model. If you have LoRAs for other base models, you will encounter errors.

Sending a Completion to a NIM - nim completion ...#

One can send a Completion request to a deployed NIM:

nim completion --prompt "Hi!" --max-tokens 1000 meta-llama3-8b-instruct

Output:

"Hi! It's nice to meet you. Is there something I can help you with or would you like to chat?"

Enable streaming using the --streaming flag.

nim completion --prompt "Hi!" --max-tokens 1000 meta-llama3-8b-instruct --streaming

Chatting with a NIM - nim chat ...#

One can chat with a NIM by sending a Chat Completion request with:

nim chat --prompt "Hi!" --max-tokens 1000 meta-llama3-8b-instruct

Output:

"Hi! It's nice to meet you. Is there something I can help you with or would you like to chat?"

Enable streaming using the --streaming flag.

nim chat --prompt "Hi!" --max-tokens 1000 meta-llama3-8b-instruct --streaming

Check the status of a running NIM - nim status#

The status of currently running NIMs can be checked with:

nim status

Output:

+--------------------------+-------------------+------------------------+
| NIM NAME                 | STATUS            | PORTS                  |
+--------------------------+-------------------+------------------------+
| /meta-llama3-8b-instruct | Up About a minute | 8000->8000,8000->8000, |
+--------------------------+-------------------+------------------------+

Inspecting the logs of the NIM - nim logs <model-id>#

The logs of a running NIM can be checked by using nim logs <model-id>. To inspect the logs of the NIM that was launched in the previous section, run the following command:

nim logs meta-llama3-8b-instruct

Benchmarking a NIM - nim benchmark <model-id>#

nim benchmark <model-id> will run a benchmarking against a deployed NIM using the GenAI-Perf Analyzer tool from Triton Inference Server SDK. To run a benchmarking against the NIM that was launched in the previous section, run the following command:

nim benchmark meta-llama3-8b-instruct

Output:

Directories created successfully.

=================================
== Triton Inference Server SDK ==
=================================

NVIDIA Release 24.04 (build 90085241)

Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

genai-perf - INFO - Running Perf Analyzer : 'perf_analyzer -m nvidian/nim-llm-dev/meta-llama3-8b-instruct@hf --async --input-data llm_inputs.json --endpoint v1/chat/completions --service-kind openai -u http://localhost:8000 --measurement-interval 10000 --stability-percentage 999 --profile-export-file profile_export.json --concurrency-range 5  -i http'