Self-hosting Llama Guard using vLLM
Detailed below are steps to self-host Llama Guard using vLLM and HuggingFace. Alternatively, you can do this using your own custom inference code with the downloaded model weights, too.
Get access to the Llama Guard model from Meta on HuggingFace. See this page for more details.
Log in to Hugging Face with your account token
huggingface-cli login
Here, we use vLLM to host a Llama Guard inference endpoint in the OpenAI-compatible mode.
pip install vllm
python -m vllm.entrypoints.openai.api_server --port 5123 --model meta-llama/LlamaGuard-7b
This will serve up the vLLM inference server on http://localhost:5123/
.
Set the host and port in your bot’s YAML configuration files (example config). If you’re running the
nemoguardrails
app on another server, remember to replacelocalhost
with your vLLM server’s public IP address.