NeMo Deploy#

Deploy and connect to your own self-hosted model endpoints using NeMo’s Export and Deploy module.

Before You Start#

Model Name Specification#

When initializing NemoQueryLLM, specify the model’s name. While NemoQueryLLM is built for querying a single model, NeMo Curator allows changing the queried model on your local server for each request.

Conversation Formatting#

Large language models take a tokenized string as input, not a list of conversation turns. Each model uses a specific format during alignment. For example, Mixtral-8x7B-Instruct-v0.1 uses:

<s> [INST] Instruction [/INST] Model answer</s> [INST] Follow-up instruction [/INST]

While OpenAI API services handle this formatting internally, with NeMo Deploy you must specify the format. NeMo Curator provides formatters for common models:

Mixtral8x7BFormatter for Mixtral-8x7B-Instruct-v0.1
NemotronFormatter for Nemotron-4 340B

Usage#

After deploying a model following the NeMo Deploy Guide, you can query it like this:

from nemo.deploy.nlp import NemoQueryLLM
from nemo_curator import NemoDeployClient
from nemo_curator.synthetic import Mixtral8x7BFormatter

model = "mistralai/mixtral-8x7b-instruct-v0.1"
nemo_client = NemoQueryLLM(url="localhost:8000", model_name=model)
client = NemoDeployClient(nemo_client)
responses = client.query_model(
    model=model,
    messages=[
        {
            "role": "user",
            "content": "Write a limerick about the wonders of GPU computing.",
        }
    ],
    temperature=0.2,
    top_p=0.7,
    max_tokens=1024,
    conversation_formatter=Mixtral8x7BFormatter(),
)
print(responses[0])