Llama Stack API (Experimental)
Support for the Llama Stack API in NIMs is experimental!
The Llama Stack API is a comprehensive set of interfaces developed by Meta for ML developers building on top of Llama foundation models. This API aims to standardize interactions with Llama models, simplifying the developer experience and fostering innovation across the Llama ecosystem. The Llama Stack encompasses various components of the model lifecycle, including inference, fine-tuning, evaluations, and synthetic data generation.
With the Llama Stack API, developers can easily integrate Llama models into their applications, leverage tool-calling capabilities, and build sophisticated AI systems. This documentation provides an overview of how to use the Python bindings for the Llama Stack API, focusing on chat completions and tool use.
For the full API documentation and source code, please visit the Llama Stack GitHub repository.
To get started with the Llama Stack API, you’ll need to install the necessary packages. You can do this using pip:
pip install llama-toolchain llama-models llama-agentic-system
These packages provide the core functionality for working with the Llama Stack API.
Here’s a simple example of how to use the Llama Stack API for a chat completion:
import asyncio
from llama_toolchain.inference.client import InferenceClient
from llama_toolchain.inference.api import ChatCompletionRequest, UserMessage
from llama_toolchain.inference.event_logger import EventLogger
from llama_models.llama3_1.api.datatypes import SamplingParams
async def main():
client = InferenceClient("http://0.0.0.0:8000/experimental/ls")
message = UserMessage(content="Explain the concept of recursion in programming.")
request = ChatCompletionRequest(
model="meta/llama-3.1-70b-instruct",
messages=[message],
stream=False,
sampling_params=SamplingParams(
max_tokens=1024
)
)
iterator = client.chat_completion(request)
async for log in EventLogger().log(iterator):
log.print()
asyncio.run(main())
To receive streaming responses, set stream=True
in the ChatCompletionRequest
:
import asyncio
from llama_toolchain.inference.client import InferenceClient
from llama_toolchain.inference.api import ChatCompletionRequest, UserMessage
from llama_toolchain.inference.event_logger import EventLogger
from llama_models.llama3_1.api.datatypes import SamplingParams
async def stream_chat():
client = InferenceClient("http://0.0.0.0:8000/experimental/ls")
message = UserMessage(content="Write a short story about a time-traveling scientist.")
request = ChatCompletionRequest(
model="meta/llama-3.1-70b-instruct",
messages=[message],
stream=True,
sampling_params=SamplingParams(
max_tokens=1024
)
)
iterator = client.chat_completion(request)
async for log in EventLogger().log(iterator):
log.print()
asyncio.run(stream_chat())
The Llama Stack API supports tool calling, allowing the model to interact with external functions.
Unlike the OpenAI API, the Llama Stack API only allows "auto"
tool choice, i.e., the model chooses when to call tools, when tools are defined.
Here’s an example:
import asyncio
from llama_toolchain.inference.client import InferenceClient
from llama_toolchain.inference.api import ChatCompletionRequest, UserMessage, ToolDefinition, ToolParamDefinition
from llama_toolchain.inference.event_logger import EventLogger
from llama_models.llama3_1.api.datatypes import SamplingParams
weather_tool = ToolDefinition(
tool_name="get_current_weather",
description="Get the current weather for a location",
parameters={
"location": ToolParamDefinition(
param_type="string",
description="The city and state, e.g. San Francisco, CA",
required=True
),
"unit": ToolParamDefinition(
param_type="string",
description="The temperature unit (celsius or fahrenheit)",
required=True
)
}
)
async def tool_calling_example():
client = InferenceClient("http://0.0.0.0:8000/experimental/ls")
message = UserMessage(content="What's the weather like in New York City?")
request = ChatCompletionRequest(
model="meta/llama-3.1-70b-instruct",
messages=[message],
available_tools=[weather_tool],
sampling_params=SamplingParams(
max_tokens=1024
)
)
iterator = client.chat_completion(request)
async for log in EventLogger().log(iterator):
log.print()
asyncio.run(tool_calling_example())