Large Language Models (1.1.0)
Large Language Models (1.1.0)

NIM CLI

The NIM CLI is a downloadable binary that makes it easy to experiment with NVIDIA NIM for LLMs.

This is in an early release state. Some known limitations:

  • Does not work on Ubuntu < 20

  • Has not been tested thoroughly on Ubuntu < 22

  • Cannot use chat subcommand with LoRAs

  • Is currently hard wired to the 24.05-rc3

  • Can only serve traffic on port 8000 on the host machine

You can download the CLI from the nvidian/nim-llm-dev via the browser or this NGC CLI command:

Copy
Copied!
            

ngc registry resource download-version nvidian/nim-llm-dev/nim-cli

It is a single binary file built for x86 architecture. You will need to chmod +x nim to make it executable after download and add it to your path.

If you do not have the ngc CLI tool, refer to the NGC CLI documentation for information on downloading and configure the tool.

Inspect the commands available using the -h help flag.

Copy
Copied!
            

nim -h

Output:

Copy
Copied!
            

NVIDIA LLM Inference Command Line Inference (CLI) Usage: nim <COMMAND> Commands: list List models (see subcommands) run Run a NIM status Check the status of running NIMs logs Tail the logs for a running NIM stop Stop a NIM benchmark Run benchmarks against a running NIM help Print this message or the help of the given subcommand(s) Options: -h, --help Print help -V, --version Print version

The command nim list all can be used to list the NIMs available to run.

Copy
Copied!
            

nim list all

Output:

Copy
Copied!
            

+--------------------------+ | Model | +--------------------------+ | meta/llama3-70b-instruct | +--------------------------+ | meta/llama3-8b-instruct | +--------------------------+

To run a NIM, use the command nim run <model-id>. The following NIM CLI command downloads the meta-llama3-8b-instruct model if it isn’t already downloaded and prepares it so you can run inferences against it.

Copy
Copied!
            

nim run meta-llama3-8b-instruct

Output:

Copy
Copied!
            

[NIM] Docker Run Command: #!/bin/bash docker run --gpus='"device=1"' --rm -ti --label name=nim__meta-llama3-8b-instruct__port-8000 \ -v ~/.cache/ngc:/home/nvs/.cache \ -e "MODEL_NAME=meta-llama3-8b-instruct" \ -e NGC_API_KEY=$NGC_API_KEY \ -p "8000:8000" \ nvcr.io/nvidian/nim-llm-dev/meta-llama3-8b-instruct:24.05.rc3 =========================================== == NVIDIA Inference Microservice LLM NIM == =========================================== NVIDIA Inference Microservice LLM NIM Version 24.05 Model: /model-store/meta-llama3-8b-instruct meta-llama3-8b-instruct Downloading model from NGC... Checking meta-llama3-8b-instruct versions... Downloading hf version of meta-llama3-8b-instruct... ... ... <logging ommitted> ... INFO: Started server process [115] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) INFO 05-14 01:14:20 metrics.py:229] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%

Our NIM is now running! You can check the models available deployed in that NIM by sending a GET request to the v1/models endpoint:

Copy
Copied!
            

curl -X GET localhost:8000/v1/models

Output:

Copy
Copied!
            

{ "object": "list", "data": [ { "id": "meta-llama3-8b-instruct", "object": "model", "created": 1715649304, "owned_by": "vllm", "root": "meta-llama3-8b-instruct", "parent": null, "permission": [ { "id": "modelperm-2a8eb3aa92fb42f0b22525f2faec0b23", "object": "model_permission", "created": 1715649304, "allow_create_engine": false, "allow_sampling": true, "allow_logprobs": true, "allow_search_indices": false, "allow_view": true, "allow_fine_tuning": false, "organization": "*", "group": null, "is_blocking": false } ] } ] }

Running a NIM w/ LoRAs

First download some meta-llama3-8b-instruct LoRAs to ~/loras on your localhost:

Copy
Copied!
            

export LOCAL_PEFT_DIRECTORY=~/loras mkdir $LOCAL_PEFT_DIRECTORY && cd $LOCAL_PEFT_DIRECTORY # downloading .nemo loras ngc registry model download-version "nvidian/nim-llm-dev/llama3-8b-instruct-lora:nemo-math-v1" ngc registry model download-version "nvidian/nim-llm-dev/llama3-8b-instruct-lora:nemo-squad-v1" # downloading HF loras ngc registry model download-version "nvidian/nim-llm-dev/llama3-8b-instruct-lora:hf-math-v1" ngc registry model download-version "nvidian/nim-llm-dev/llama3-8b-instruct-lora:hf-squad-v1" chmod -R 777 $LOCAL_PEFT_DIRECTORY

Then run the NIM:

Copy
Copied!
            

nim run meta-llama3-8b-instruct --peft-path $LOCAL_PEFT_DIRECTORY

Note that as of 24.05.rc3, the LoRAs in the directory passed to --peft-path must be for the base model. If you have LoRAs for other base models, you will encounter errors.

One can send a Completion request to a deployed NIM:

Copy
Copied!
            

nim completion --prompt "Hi!" --max-tokens 1000 meta-llama3-8b-instruct

Output:

Copy
Copied!
            

"Hi! It's nice to meet you. Is there something I can help you with or would you like to chat?"

Enable streaming using the --streaming flag.

Copy
Copied!
            

nim completion --prompt "Hi!" --max-tokens 1000 meta-llama3-8b-instruct --streaming

One can chat with a NIM by sending a Chat Completion request with:

Copy
Copied!
            

nim chat --prompt "Hi!" --max-tokens 1000 meta-llama3-8b-instruct

Output:

Copy
Copied!
            

"Hi! It's nice to meet you. Is there something I can help you with or would you like to chat?"

Enable streaming using the --streaming flag.

Copy
Copied!
            

nim chat --prompt "Hi!" --max-tokens 1000 meta-llama3-8b-instruct --streaming

The status of currently running NIMs can be checked with:

Copy
Copied!
            

nim status

Output:

Copy
Copied!
            

+--------------------------+-------------------+------------------------+ | NIM NAME | STATUS | PORTS | +--------------------------+-------------------+------------------------+ | /meta-llama3-8b-instruct | Up About a minute | 8000->8000,8000->8000, | +--------------------------+-------------------+------------------------+

The logs of a running NIM can be checked by using nim logs <model-id>. To inspect the logs of the NIM that was launched in the previous section, run the following command:

Copy
Copied!
            

nim logs meta-llama3-8b-instruct

nim benchmark <model-id> will run a benchmarking against a deployed NIM using the GenAI-Perf Analyzer tool from Triton Inference Server SDK. To run a benchmarking against the NIM that was launched in the previous section, run the following command:

Copy
Copied!
            

nim benchmark meta-llama3-8b-instruct

Output:

Copy
Copied!
            

Directories created successfully. ================================= == Triton Inference Server SDK == ================================= NVIDIA Release 24.04 (build 90085241) Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved. This container image and its contents are governed by the NVIDIA Deep Learning Container License. By pulling and using the container, you accept the terms and conditions of this license: https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license genai-perf - INFO - Running Perf Analyzer : 'perf_analyzer -m nvidian/nim-llm-dev/meta-llama3-8b-instruct@hf --async --input-data llm_inputs.json --endpoint v1/chat/completions --service-kind openai -u http://localhost:8000 --measurement-interval 10000 --stability-percentage 999 --profile-export-file profile_export.json --concurrency-range 5 -i http'

© Copyright © 2024, NVIDIA Corporation. Last updated on Sep 9, 2024.