Getting Started with TopicControl#

Prerequisites#

A host with Docker Engine. Refer to the instructions from Docker.
NVIDIA Container Toolkit installed and configured. Refer to installation in the toolkit documentation.
An active subscription to an NVIDIA AI Enterprise product or be an NVIDIA Developer Program member. Access to the containers and models is restricted.
An NGC API key. The container uses the key to send inference requests to models NVIDIA API Catalog. Refer to Generating Your NGC API Key in the NVIDIA NGC User Guide for more information.

When you create an NGC API personal key, select at least NGC Catalog from the Services Included menu. You can specify more services to use the key for additional purposes.

Starting the NIM Container#

Log in to NVIDIA NGC so you can pull the container.

Export your NGC API key as an environment variable:
```
$ export NGC_API_KEY="<nvapi-...>"
```

Log in to the registry:

$ docker login nvcr.io --username '$oauthtoken' --password-stdin <<< $NGC_API_KEY

Download the container:

$ docker pull nvcr.io/nim/nvidia/llama-3.1-nemoguard-8b-topic-control:1.10.1

Create a model cache directory on the host machine:

$ export LOCAL_NIM_CACHE=~/.cache/llama-nemotron-topic-guard
$ mkdir -p "${LOCAL_NIM_CACHE}"
$ chmod 700 "${LOCAL_NIM_CACHE}"

Run the container with the cache directory as a volume mount:

$ docker run -d \
  --name llama-nemotron-topic-guard \
  --gpus=all --runtime=nvidia \
  -e NGC_API_KEY \
  -e NIM_SERVED_MODEL_NAME="llama-nemotron-topic-guard-v1" \
  -e NIM_CUSTOM_MODEL_NAME="llama-nemotron-topic-guard-v1" \
  -u $(id -u) \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache/" \
  -p 8000:8000 \
  nvcr.io/nim/nvidia/llama-3.1-nemoguard-8b-topic-control:1.10.1

The container requires several minutes to start and download the model from NGC. You can monitor the progress by running the docker logs llama-nemotron-topic-guard command.

Optional: Confirm the service is ready to respond to inference requests:

$ curl -X GET http://localhost:8000/v1/health/ready

Example Output

{"object":"health-response","message":"ready"}

Running Inference#

You can send requests to the v1/chat/completions endpoint to perform inference.

The following steps demonstrate creating a Python script that performs the following actions:

Connects to the container with the microservice and the topic control model.
Provides a prompt that provides topic control instructions to the topic control model.

Create a development environment and install dependencies:

$ conda create -n evals python=3.10
$ conda activate evals
$ pip install requests

Create a file, such as topic_control_inference_example.py, with contents like the following example:

import argparse
from typing import List, Optional
import requests

TOPIC_SAFETY_OUTPUT_RESTRICTION = (
    'If any of the above conditions are violated, please respond with "off-topic". '
    'Otherwise, respond with "on-topic". '
    'You must respond with "on-topic" or "off-topic".'
)

class TopicGuard:
    def __init__(
        self,
        host: str = "0.0.0.0",
        port: str = "8000",
        model_name: str = "llama-nemotron-topic-guard-v1",
    ):
        self.uri = f"http://{host}:{port}/v1/chat/completions"
        self.model_name = model_name

    def __call__(self, prompt: List[dict]) -> str:
        return self._call(prompt)

    def _call(self, prompt: List[dict], stop: Optional[List[str]] = None) -> str:
        try:
            response = requests.post(
                self.uri,
                headers={
                    "Content-Type": "application/json",
                    "Accept": "application/json",
                },
                json={
                    "model": self.model_name,
                    "messages": prompt,
                    "max_tokens": 20,
                    "top_p": 1,
                    "n": 1,
                    "temperature": 0.0,
                    "stream": False,
                    "frequency_penalty": 0.0,
                },
            )
            if response.status_code != 200:
                raise Exception(
                    f"Error response from the LLM. Status code: {response.status_code} {response.text}"
                )
            return response.json()["choices"][0]["message"]["content"].strip()
        except Exception as e:
            print(e)
            return "error"


def format_prompt(system_prompt: str, user_message: str) -> str:

    system_prompt = system_prompt.strip()

    if not system_prompt.endswith(TOPIC_SAFETY_OUTPUT_RESTRICTION):
        system_prompt = f"{system_prompt}\n\n{TOPIC_SAFETY_OUTPUT_RESTRICTION}"

    prompt = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_message},
    ]

    return prompt


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--nim_host", type=str, default="0.0.0.0")
    parser.add_argument("--nim_port", type=str, default="8000")
    parser.add_argument(
        "--nim_model_name", type=str, default="llama-nemotron-topic-guard-v1"
    )
    args = parser.parse_args()

    system_prompt = """You are to act as an investor relations bot for ABC, providing users with factual, publicly available information related to the company's financial performance and corporate updates. Your role is to ensure that you respond only to relevant queries and adhere to the following guidelines:

1. Do not answer questions about future predictions, such as profit forecasts or future revenue outlook.
2. Do not provide any form of investment advice, including recommendations to buy, sell, or hold ABC stock or any other securities. Never recommend any stock or investment.
3. Do not engage in discussions that require personal opinions or subjective judgments. Never make any subjective statements about ABC, its stock or its products.
4. If a user asks about topics irrelevant to ABC's investor relations or financial performance, politely redirect the conversation or end the interaction.
5. Your responses should be professional, accurate, and compliant with investor relations guidelines, focusing solely on providing transparent, up-to-date information about ABC that is already publicly available."""

    user_message = (
        "Can you speculate on the potential impact of a recession on ABCs business?"
    )

    print(
        f"Using Nim inference mode with host: {args.nim_host} and port: {args.nim_port}"
    )
    topic_guard = TopicGuard(
        host=args.nim_host, port=args.nim_port, model_name=args.nim_model_name
    )

    prompt = format_prompt(system_prompt, user_message)
    response = topic_guard(prompt)

    print(f"For user message: {user_message}")
    print(f"\nResponse from TopicGuard model: {response}")

Run the script to perform inference:

$ python topic_control_inference_example.py

Stopping the Container#

The following commands stop the container by stopping and removing the running container.

$ docker stop llama-nemotron-topic-guard
$ docker rm llama-nemotron-topic-guard