Large Language Models (Latest)
Large Language Models (Latest)

Llama Stack API (Experimental)

Warning

Support for the Llama Stack API in NIMs is experimental!

The Llama Stack API is a comprehensive set of interfaces developed by Meta for ML developers building on top of Llama foundation models. This API aims to standardize interactions with Llama models, simplifying the developer experience and fostering innovation across the Llama ecosystem. The Llama Stack encompasses various components of the model lifecycle, including inference, fine-tuning, evaluations, and synthetic data generation.

With the Llama Stack API, developers can easily integrate Llama models into their applications, leverage tool-calling capabilities, and build sophisticated AI systems. This documentation provides an overview of how to use the Python bindings for the Llama Stack API, focusing on chat completions and tool use.

For the full API documentation and source code, please visit the Llama Stack GitHub repository.

To get started with the Llama Stack API, you’ll need to install the necessary packages. You can do this using pip:

Copy
Copied!
            

pip install llama-toolchain llama-models llama-agentic-system

These packages provide the core functionality for working with the Llama Stack API.

Here’s a simple example of how to use the Llama Stack API for a chat completion:

Copy
Copied!
            

import asyncio from llama_toolchain.inference.client import InferenceClient from llama_toolchain.inference.api import ChatCompletionRequest, UserMessage from llama_toolchain.inference.event_logger import EventLogger from llama_models.llama3_1.api.datatypes import SamplingParams async def main(): client = InferenceClient("http://0.0.0.0:8000/experimental/ls") message = UserMessage(content="Explain the concept of recursion in programming.") request = ChatCompletionRequest( model="meta/llama-3.1-70b-instruct", messages=[message], stream=False, sampling_params=SamplingParams( max_tokens=1024 ) ) iterator = client.chat_completion(request) async for log in EventLogger().log(iterator): log.print() asyncio.run(main())

To receive streaming responses, set stream=True in the ChatCompletionRequest:

Copy
Copied!
            

import asyncio from llama_toolchain.inference.client import InferenceClient from llama_toolchain.inference.api import ChatCompletionRequest, UserMessage from llama_toolchain.inference.event_logger import EventLogger from llama_models.llama3_1.api.datatypes import SamplingParams async def stream_chat(): client = InferenceClient("http://0.0.0.0:8000/experimental/ls") message = UserMessage(content="Write a short story about a time-traveling scientist.") request = ChatCompletionRequest( model="meta/llama-3.1-70b-instruct", messages=[message], stream=True, sampling_params=SamplingParams( max_tokens=1024 ) ) iterator = client.chat_completion(request) async for log in EventLogger().log(iterator): log.print() asyncio.run(stream_chat())

The Llama Stack API supports tool calling, allowing the model to interact with external functions.

Important

Unlike the OpenAI API, the Llama Stack API only allows "auto" tool choice, i.e., the model chooses when to call tools, when tools are defined.

Here’s an example:

Copy
Copied!
            

import asyncio from llama_toolchain.inference.client import InferenceClient from llama_toolchain.inference.api import ChatCompletionRequest, UserMessage, ToolDefinition, ToolParamDefinition from llama_toolchain.inference.event_logger import EventLogger from llama_models.llama3_1.api.datatypes import SamplingParams weather_tool = ToolDefinition( tool_name="get_current_weather", description="Get the current weather for a location", parameters={ "location": ToolParamDefinition( param_type="string", description="The city and state, e.g. San Francisco, CA", required=True ), "unit": ToolParamDefinition( param_type="string", description="The temperature unit (celsius or fahrenheit)", required=True ) } ) async def tool_calling_example(): client = InferenceClient("http://0.0.0.0:8000/experimental/ls") message = UserMessage(content="What's the weather like in New York City?") request = ChatCompletionRequest( model="meta/llama-3.1-70b-instruct", messages=[message], available_tools=[weather_tool], sampling_params=SamplingParams( max_tokens=1024 ) ) iterator = client.chat_completion(request) async for log in EventLogger().log(iterator): log.print() asyncio.run(tool_calling_example())

Previous Function Calling
Next Utilities
© Copyright © 2024, NVIDIA Corporation. Last updated on Sep 5, 2024.