Quick Start Guide#

In this guide, we will walk you through the process of building a simple speech-to-speech voice assistant pipeline using NVIDIA Pipecat along with the Pipecat-AI library and deploying it for testing.

Installing the NVIDIA Pipecat Library#

Create a Python virtual environment and use the pip command to install the nvidia-pipecat package.

python -m venv venv

source venv/bin/activate

pip install nvidia-pipecat

Note

The nvidia-pipecat package requires Python version 3.12. The source code of the nvidia-pipecat package is available in the ACE Controller GitHub repository.

Building a Basic Pipeline#

In this section, we will showcase how to build a simple speech-to-speech voice assistant pipeline using nvidia-pipecat along with the pipecat-ai library and deploy it for testing. This pipeline will use WebSocket-based ACETransport, Riva ASR and TTS models, and NVIDIA LLM Service. It is recommended to first follow the Pipecat documentation or the Pipecat Overview section to understand core concepts.

Note

The source code for this example is available in the ACE Controller GitHub repository.

Configuring the Pipeline#

  1. Create a Python file containing the pipeline code.

    1. Create a file named bot.py and import all necessary frame processors from the pipecat and nvidia_pipecat packages to set up the pipeline.

    2. Configure the pipeline as follows:

      bot.py#
      # Import necessary libraries
      ...
      
      
      async def create_pipeline_task(pipeline_metadata: PipelineMetadata):
      
          # Using a transport for audio input/output
          transport = ACETransport(
              websocket=pipeline_metadata.websocket,
                  params=ACETransportParams(
                  vad_enabled=True,
                  vad_analyzer=SileroVADAnalyzer(),
                  vad_audio_passthrough=True,
              ),
          )
          #ACETransport supports both RTSP and WebSocket protocols for communication and enables pipeline scaling to handle multiple streams simultaneously
      
      
          # Setting up a ASR service
          stt = RivaASRService(api_key=os.getenv("NVIDIA_API_KEY"))
      
          # Setting up a TTS service
          tts = RivaTTSService(api_key=os.getenv("NVIDIA_API_KEY"))
      
          #RivaASRService and RivaTTSService support both NVCF hosted and RivaSkills local deployment endpoints
      
          # Setting up a LLM service
          llm = NvidiaLLMService(api_key=os.getenv("NVIDIA_API_KEY"))
      
          #NvidiaLLMService supports connecting to both NIM-hosted models and locally deployed NIM LLMs.
      
          # Create context for the LLM
          messages = [
              {
                  "role": "system",
                  "content": "You are a helpful assistant. Always answer as helpful, friendly and polite. Respond with one sentence or less than 75 characters. Do not respond with a bulleted or numbered list. Your output will be converted to audio so don't include special characters in your answers.",
              },
          ]
          context = OpenAILLMContext(messages)
          context_aggregator = llm.create_context_aggregator(context)
      
          # Creating a simple processing pipeline
          pipeline = Pipeline([
              transport.input(),  # Websocket input from client
              stt,  # Speech-To-Text
              context_aggregator.user(),
              llm,  # LLM
              tts,  # Text-To-Speech
              transport.output(),  # Websocket output to client
              context_aggregator.assistant(),
            ]
          )
          task = PipelineTask(
          pipeline,)
      
          # Handling client events and generating responses
          @transport.event_handler("on_client_connected")
          async def on_client_connected(transport, client):
              # Kick off the conversation.
              messages.append({"role": "system", "content": "Please introduce yourself to the user."})
              await task.queue_frames([LLMMessagesFrame(messages)])
          return task
      
      
      # Initialize the FastAPI application and configure routing
      app = FastAPI()
      app.include_router(websocket_router)
      runner = ACEPipelineRunner(pipeline_callback=create_pipeline_task)
      # Connecting to WebUI Client
      app.mount("/static", StaticFiles(directory=os.path.join(os.path.dirname(__file__), "../static")), name="static")
      
      # Managing application lifecycle
      if __name__ == "__main__":
              uvicorn.run(app, host="0.0.0.0", port=8100)
      

    If you have tried deploying pipecat pipelines in the past, notice the following differences from the above code:

    • FastAPI Server: We use FastAPI http and WebSocket server to support scaling of pipelines to multiple users.

    • ACEPipelineRunner: Instead of the Pipecat PipelineRunner, we use ACEPipelineRunner to manage multiple instances of pipeline along with FastAPI router APIs. ACEPipelineRunner expects a function as input which creates new instances of Pipeline and PipelineTask on demand.

    Refer to the NVIDIA Speech-to-Speech Example workflow for source code and detailed deployment steps.

  2. Set up a Web UI interface. To create a simple Web UI interface for your application, use the provided frames.proto and index.html files. Copy these files into the ../static directory or provide your own custom FastAPI app mount directory in the above code. These files are essential for establishing a WebSocket connection and handling audio data transmission between the client and server.

    • frames.proto defines the Protocol Buffers (protobuf) message structures used for encoding and decoding audio and text data.

    • index.html provides the web interface for capturing audio from the user’s microphone and sending it to the server via WebSocket.

Running the Pipeline#

  1. Set up the environment variables. Export all required environment variables for the necessary API keys. This ensures that the application can access the services it needs. For example:

    export NVIDIA_API_KEY = nvapi-…
    
  2. Start the application. Run the Python file to start the FastAPI application and initialize the pipeline.

    python bot.py
    

Interacting with the WebUI Client#

  1. Access the WebUI. Open your web browser and navigate to http://localhost:8100/static/index.html.

  2. Configure your browser:

    1. Grant the application permission to access your microphone and speakers to enable audio input and output.

    2. If you encounter issues, ensure that you add http://localhost:8100 to your browser’s allowed origins. This may require adjusting flags in Chrome or Edge.