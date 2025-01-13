Llama Stack API (Experimental)#

Warning Support for the Llama Stack API in NIMs is experimental!

The Llama Stack API is a comprehensive set of interfaces developed by Meta for ML developers building on top of Llama foundation models. This API aims to standardize interactions with Llama models, simplifying the developer experience and fostering innovation across the Llama ecosystem. The Llama Stack encompasses various components of the model lifecycle, including inference, fine-tuning, evaluations, and synthetic data generation.

With the Llama Stack API, developers can easily integrate Llama models into their applications, leverage tool-calling capabilities, and build sophisticated AI systems. This documentation provides an overview of how to use the Python bindings for the Llama Stack API, focusing on chat completions and tool use.

For the full API documentation and source code, please visit the Llama Stack GitHub repository.

Installation# To get started with the Llama Stack API, you’ll need to install the necessary packages. You can do this using pip: pip install llama-toolchain llama-models llama-agentic-system These packages provide the core functionality for working with the Llama Stack API.

Basic Usage# Here’s a simple example of how to use the Llama Stack API for a chat completion: import asyncio from llama_toolchain.inference.client import InferenceClient from llama_toolchain.inference.api import ChatCompletionRequest , UserMessage from llama_toolchain.inference.event_logger import EventLogger from llama_models.llama3_1.api.datatypes import SamplingParams async def main (): client = InferenceClient ( "http://0.0.0.0:8000/experimental/ls" ) message = UserMessage ( content = "Explain the concept of recursion in programming." ) request = ChatCompletionRequest ( model = "meta/llama-3.1-70b-instruct" , messages = [ message ], stream = False , sampling_params = SamplingParams ( max_tokens = 1024 ) ) iterator = client . chat_completion ( request ) async for log in EventLogger () . log ( iterator ): log . print () asyncio . run ( main ())

Streaming Responses# To receive streaming responses, set stream=True in the ChatCompletionRequest : import asyncio from llama_toolchain.inference.client import InferenceClient from llama_toolchain.inference.api import ChatCompletionRequest , UserMessage from llama_toolchain.inference.event_logger import EventLogger from llama_models.llama3_1.api.datatypes import SamplingParams async def stream_chat (): client = InferenceClient ( "http://0.0.0.0:8000/experimental/ls" ) message = UserMessage ( content = "Write a short story about a time-traveling scientist." ) request = ChatCompletionRequest ( model = "meta/llama-3.1-70b-instruct" , messages = [ message ], stream = True , sampling_params = SamplingParams ( max_tokens = 1024 ) ) iterator = client . chat_completion ( request ) async for log in EventLogger () . log ( iterator ): log . print () asyncio . run ( stream_chat ())