Realtime API Reference#

Top

Overview#

Riva Realtime Server provides a WebSocket-based API for real-time speech processing (ASR - Automatic Speech Recognition). This API allows you to stream audio data and receive real-time transcription results.

Reference#

The WebSocket server provides real-time communication capabilities for transcription services. To establish a connection, clients must connect to the WebSocket endpoint with the required query parameter.

Transcription Sessions Endpoint#

The transcription sessions endpoint allows you to create transcription sessions.

Base URL:

http://<address>:9000

Endpoint:

/v1/realtime/transcription_sessions

Method: POST

Response: Returns the initial default transcription session configuration.

{
  "id": "sess_<uuid4>",
  "object": "realtime.transcription_session",
  "modalities": ["text"],
  "input_audio_format": "pcm16",
  "input_audio_transcription":
  {
      "language": "en-US",
      "model": "conformer",
      "prompt": ""
  },
  "input_audio_params":
  {
      "sample_rate_hz": 16000,
      "num_channels": 1
  },
  "recognition_config":
  {
      "max_alternatives": 1,
      "enable_automatic_punctuation": false,
      "enable_word_time_offsets": false,
      "enable_profanity_filter": false,
      "enable_verbatim_transcripts": false
  },
  "speaker_diarization":
  {
      "enable_speaker_diarization": false,
      "max_speaker_count": 8
  },
  "word_boosting":
  {
      "enable_word_boosting": false,
      "word_boosting_list": []
  },
  "endpointing_config":
  {
      "start_history": 0,
      "start_threshold": 0,
      "stop_history": 0,
      "stop_threshold": 0,
      "stop_history_eou": 0,
      "stop_threshold_eou": 0
  },
  "client_secret" : null
}

Parameters#

Parameter

Type

Required

Description

Default

id

string

No

Session identifier

auto-generated (“sess_”)

object

string

Yes

Object type identifier

“realtime.transcription_session”

modalities

array

Yes

List of supported modalities

[“text”]

input_audio_format

string

Yes

Audio format. Currently only “pcm16” supported

“pcm16”

input_audio_transcription.language

string

No

Transcription language

“en-US”

input_audio_transcription.model

string

No

ASR model to use

“conformer”

input_audio_transcription.prompt

string

No

Optional prompt for transcription

“”

input_audio_params.sample_rate_hz

integer

No

Audio sample rate in Hz

16000

input_audio_params.num_channels

integer

No

Number of audio channels

1

recognition_config.max_alternatives

integer

No

Maximum number of recognition alternatives

1

recognition_config.enable_automatic_punctuation

boolean

No

Enable automatic punctuation

false

recognition_config.enable_word_time_offsets

boolean

No

Enable word-level timing information

false

recognition_config.enable_profanity_filter

boolean

No

Enable profanity filtering

false

recognition_config.enable_verbatim_transcripts

boolean

No

Enable verbatim transcription

false

speaker_diarization.enable_speaker_diarization

boolean

No

Enable speaker diarization

false

speaker_diarization.max_speaker_count

integer

No

Maximum number of speakers to detect

8

word_boosting.enable_word_boosting

boolean

No

Enable word boosting

false

word_boosting.word_boosting_list

array

No

List of words to boost

[]

endpointing_config.start_history

integer

No

Start history for endpointing

0

endpointing_config.start_threshold

integer

No

Start threshold for endpointing

0

endpointing_config.stop_history

integer

No

Stop history for endpointing

0

endpointing_config.stop_threshold

integer

No

Stop threshold for endpointing

0

endpointing_config.stop_history_eou

integer

No

Stop history for end-of-utterance

0

endpointing_config.stop_threshold_eou

integer

No

Stop threshold for end-of-utterance

0

client_secret

string

No

Client authentication secret

null

WebSocket Connection Details#

Base URL:

ws://<address>:9000

Endpoint:

/v1/realtime

Required Query Parameter:

intent=transcription

Example Connection URL#

Here’s a complete example for connecting to the WebSocket server:

ws://localhost:9000/v1/realtime?intent=transcription

Connection Requirements#

To establish a connection, the WebSocket client must include the intent query parameter in the URL, specifying a supported value. At present, the only valid intent is “transcription”. The server listens on port 9000 by default and uses the standard WebSocket protocol (ws://). If the intent parameter is missing or invalid, the server will close the connection and return WebSocket code 1008 (Policy Violation).

Usage Notes#

Clients should support WebSocket connections and maintain the connection for the entire duration of the transcription session. It is important to implement proper error handling and reconnection logic within the client to ensure a robust and reliable experience.

Health Check Endpoint#

The health check endpoint provides a way to verify the server’s operational status.

Endpoint:

/v1/health

Method: GET

Response:

{
  "status": "ok"
}

Status Codes:

  • 200 OK: Server is healthy and ready to accept connections

  • 503 Service Unavailable: Server is not ready to accept connections

Use Cases:

  • Pre-flight check before establishing WebSocket connections

  • Load balancer health monitoring

  • System status monitoring

Events#

WebSocket events#

The realtime server uses a WebSocket-based event system for communication between clients and the server. Events are JSON messages that follow a specific format and are used to handle various operations like session management, audio processing, and transcription.

Each event has:

  • A unique event_id for tracking

  • A type field indicating the event type

  • Additional fields specific to the event type

Events are categorized into:

  1. Client Events: Events sent from client to server

    • Session management (create, update)

    • Audio buffer operations (append, commit)

  2. Server Events: Events sent from server to client

    • Session responses (created, updated)

    • Transcription results (delta, completed, failed)

    • Error notifications

    • Status updates

The server validates all incoming events and sends appropriate error messages for:

  • Invalid event formats

  • Unsupported features

  • Message size limits

  • Server errors

WebSocket Events

Client Events#

Events that can be sent from the client to the server.

List of Client Events#

Event Type

Description

transcription_session.update

Updates session configuration

input_audio_buffer.append

Sends audio data for processing

input_audio_buffer.commit

Commits the current audio buffer

input_audio_buffer.clear

Clears the audio bytes in the buffer

input_audio_buffer.done

Tells the server that the client is done sending audio data

transcription_session.update#

Send this event to update a transcription session.

{
    "event_id" : "event_<uuid4>",
    "session":
    {
        "modalities": ["text"],
        "input_audio_format": "pcm16",
        "input_audio_transcription":
        {
            "language": "en-US",
            "model": "conformer",
            "prompt": ""
        },
        "input_audio_params":
        {
            "sample_rate_hz": 16000,
            "num_channels": 1
        },
        "recognition_config":
        {
            "max_alternatives": 1,
            "enable_automatic_punctuation": false,
            "enable_word_time_offsets": false,
            "enable_profanity_filter": false,
            "enable_verbatim_transcripts": false
        },
        "speaker_diarization":
        {
            "enable_speaker_diarization": false,
            "max_speaker_count": 8
        },
        "word_boosting":
        {
            "enable_word_boosting": false,
            "word_boosting_list": []
        },
        "endpointing_config":
        {
            "start_history": 0,
            "start_threshold": 0,
            "stop_history": 0,
            "stop_threshold": 0,
            "stop_history_eou": 0,
            "stop_threshold_eou": 0
        }
    },
    "type": "transcription_session.update"
}
Parameters#

Parameter

Type

Required

Description

Default

event_id

string

No

Event identifier

auto-generated (“conv_uuid4”)

type

string

Yes

Event type

“transcription_session.update”

session.modalities

array

Yes

List of supported modalities

[“text”]

session.input_audio_format

string

Yes

Audio format. Currently only “pcm16” is supported

“pcm16”

session.input_audio_transcription.language

string

No

Transcription language

“en-US”

session.input_audio_transcription.model

string

No

ASR model to use

“conformer”

session.input_audio_transcription.prompt

string

No

Optional prompt for transcription

“”

session.input_audio_params.sample_rate_hz

integer

No

Audio sample rate in Hz

16000

session.input_audio_params.num_channels

integer

No

Number of audio channels

1

session.recognition_config.max_alternatives

integer

No

Maximum number of recognition alternatives

1

session.recognition_config.enable_automatic_punctuation

boolean

No

Enable automatic punctuation

false

session.recognition_config.enable_word_time_offsets

boolean

No

Enable word-level timing information

false

session.recognition_config.enable_profanity_filter

boolean

No

Enable profanity filtering

false

session.recognition_config.enable_verbatim_transcripts

boolean

No

Enable verbatim transcription

false

session.speaker_diarization.enable_speaker_diarization

boolean

No

Enable speaker diarization

false

session.speaker_diarization.max_speaker_count

integer

No

Maximum number of speakers to detect

8

session.word_boosting.enable_word_boosting

boolean

No

Enable word boosting

false

session.word_boosting.word_boosting_list

array

No

List of words to boost

[]

session.endpointing_config.start_history

integer

No

Start history for endpointing

0

session.endpointing_config.start_threshold

integer

No

Start threshold for endpointing

0

session.endpointing_config.stop_history

integer

No

Stop history for endpointing

0

session.endpointing_config.stop_threshold

integer

No

Stop threshold for endpointing

0

session.endpointing_config.stop_history_eou

integer

No

Stop history for end-of-utterance

0

session.endpointing_config.stop_threshold_eou

integer

No

Stop threshold for end-of-utterance

0

input_audio_buffer.append#

Sends audio data to the server for processing.

{
  "event_id": "event_0000",
  "type": "input_audio_buffer.append",
  "audio": "<Base64EncodedAudioData>"
}
Parameters#

Parameter

Type

Required

Description

Default

event_id

string

No

Optional event identifier

event_0000

type

string

Yes

Event type

“input_audio_buffer.append”

audio

string

Yes

Base64-encoded audio data. Maximum size: 15MB

-

input_audio_buffer.commit#

Commits the current audio buffer for processing.

{
  "event_id": "event_0000",
  "type": "input_audio_buffer.commit"
}
Parameters#

Parameter

Type

Required

Description

Default

event_id

string

No

Optional event identifier

event_0000

type

string

Yes

Event type

“input_audio_buffer.commit”

input_audio_buffer.done#

Tells the server that the client is done sending audio data and wants to stop the inference processing. This event triggers the server to process any remaining audio chunks in the buffer and then stop the inference task.

Note

This parameter is mandatory for audio file processing.```

{
    "event_id": "event_0000",
    "type": "input_audio_buffer.done"
}
Parameters#

Parameter

Type

Required

Description

Default

event_id

string

No

Optional event identifier

event_0000

type

string

Yes

Event type

“input_audio_buffer.done”

input_audio_buffer.clear#

Clears the current audio buffer.

{
  "event_id": "event_0000",
  "type": "input_audio_buffer.clear"
}
Parameters#

Parameter

Type

Required

Description

Default

event_id

string

No

Optional event identifier

event_0000

type

string

Yes

Event type

“input_audio_buffer.clear”

Server Events#

These are events emitted from the server to the client.

List of Server Events#

Event Type

Description

conversation.created

Returned when a conversation is created

transcription_session.updated

Sent when session configuration is updated

input_audio_buffer.committed

Returned when an input audio buffer is committed

input_audio_buffer.cleared

Returned when the input audio buffer is cleared

conversation.item.input_audio_transcription.delta

Sent when new transcription results are available

conversation.item.input_audio_transcription.completed

Sent when new transcription results are available

conversation.item.input_audio_transcription.failed

Sent when transcription fails

error

Sent when an error occurs

conversation.created#

Returned when a conversation session is created.

{
    "event_id": "event_<uuid4>",
    "type": "conversation.created",
    "conversation": {
        "id": "conv_<uuid4>",
        "object": "realtime.conversation"
    }
}
Parameters#

Parameter

Type

Required

Description

Default

event_id

string

No

Optional event identifier

auto-generated (“event_uuid4”)

type

string

Yes

Event type

“conversation.created”

conversation.id

string

Yes

The unique ID of the conversation

auto-generated (“conv_uuid4”)

conversation.object

string

Yes

Must be ‘realtime.conversation’

“realtime.conversation”

transcription_session.updated#

Returned when a transcription session is updated.

{
    "event_id" : "event_<uuid4>",
    "session":
    {
        "modalities": ["text"],
        "input_audio_format": "pcm16",
        "input_audio_transcription":
        {
            "language": "en-US",
            "model": "conformer",
            "prompt": ""
        },
        "input_audio_params":
        {
            "sample_rate_hz": 16000,
            "num_channels": 1
        },
        "recognition_config":
        {
            "max_alternatives": 1,
            "enable_automatic_punctuation": false,
            "enable_word_time_offsets": false,
            "enable_profanity_filter": false,
            "enable_verbatim_transcripts": false
        },
        "speaker_diarization":
        {
            "enable_speaker_diarization": false,
            "max_speaker_count": 8
        },
        "word_boosting":
        {
            "enable_word_boosting": false,
            "word_boosting_list": []
        },
        "endpointing_config":
        {
            "start_history": 0,
            "start_threshold": 0,
            "stop_history": 0,
            "stop_threshold": 0,
            "stop_history_eou": 0,
            "stop_threshold_eou": 0
        }
    },
    "type": "transcription_session.updated"
}
Parameters#

Parameter

Type

Required

Description

Default

event_id

string

No

Event identifier

auto-generated (“event_”)

type

string

Yes

Event type

“transcription_session.updated”

session.modalities

array

Yes

List of supported modalities

[“text”]

session.input_audio_format

string

Yes

Audio format. Currently only “pcm16” supported

“pcm16”

session.input_audio_transcription.language

string

No

Transcription language

“en-US”

session.input_audio_transcription.model

string

No

ASR model to use

“conformer”

session.input_audio_transcription.prompt

string

No

Optional prompt for transcription

“”

session.input_audio_params.sample_rate_hz

integer

No

Audio sample rate in Hz

16000

session.input_audio_params.num_channels

integer

No

Number of audio channels

1

session.recognition_config.max_alternatives

integer

No

Maximum number of recognition alternatives

1

session.recognition_config.enable_automatic_punctuation

boolean

No

Enable automatic punctuation

false

session.recognition_config.enable_word_time_offsets

boolean

No

Enable word-level timing information

false

session.recognition_config.enable_profanity_filter

boolean

No

Enable profanity filtering

false

session.recognition_config.enable_verbatim_transcripts

boolean

No

Enable verbatim transcription

false

session.speaker_diarization.enable_speaker_diarization

boolean

No

Enable speaker diarization

false

session.speaker_diarization.max_speaker_count

integer

No

Maximum number of speakers to detect

8

session.word_boosting.enable_word_boosting

boolean

No

Enable word boosting

false

session.word_boosting.word_boosting_list

array

No

List of words to boost

[]

session.endpointing_config.start_history

integer

No

Start history for endpointing

0

session.endpointing_config.start_threshold

integer

No

Start threshold for endpointing

0

session.endpointing_config.stop_history

integer

No

Stop history for endpointing

0

session.endpointing_config.stop_threshold

integer

No

Stop threshold for endpointing

0

session.endpointing_config.stop_history_eou

integer

No

Stop history for end-of-utterance

0

session.endpointing_config.stop_threshold_eou

integer

No

Stop threshold for end-of-utterance

0

input_audio_buffer.committed#

Returned when an input audio buffer is committed. At the same time, the buffer is sent for inference.

{
    "event_id" : "event_0000",
    "type": "input_audio_buffer.committed",
    "previous_item_id": "msg_0000",
    "item_id": "msg_0001"
}
Parameters#

Parameter

Type

Required

Description

Default

event_id

string

No

Optional event identifier

event_0000

type

string

Yes

Event type

“input_audio_buffer.committed”

previous_item_id

string

No

ID of the preceding item

msg_0000

item_id

string

No

ID of the current item

msg_0001

input_audio_buffer.cleared#

Returned when the input audio buffer is cleared.

{
    "event_id": "event_0000",
    "type": "input_audio_buffer.cleared"
}
Parameters#

Parameter

Type

Required

Description

Default

event_id

string

No

Optional event identifier

event_0000

type

string

Yes

Event type

“input_audio_buffer.cleared”

conversation.item.input_audio_transcription.delta#

Returned transcription text, when response received from gRPC server

{
    "event_id": "event_0000",
    "type": "conversation.item.input_audio_transcription.delta",
    "item_id": "item_001",
    "content_index": 0,
    "delta": "Hello"
}
Parameters#

Parameter

Type

Required

Description

Default

event_id

string

No

Optional event identifier

event_0000

type

string

Yes

Event type

“conversation.item.input_audio_transcription.delta”

item_id

string

No

Optional item identifier

item_0000

content_index

integer

No

The index of the content part

0

delta

string

Yes

Transcription result in streaming mode

“”

conversation.item.input_audio_transcription.completed#

Returns transcription text when a response is received from the gRPC server.

{
    "event_id": "event_0000",
    "type": "conversation.item.input_audio_transcription.completed",
    "item_id": "msg_0000",
    "content_index": 0,
    "transcript": "Hello, how are you?",
    "words_info": {
        "words": [
            {
                "word": "Hello",
                "start_time": 0.0,
                "end_time": 1.0,
                "confidence": 0.95,
                "speaker_tag": 0
            }
        ]
    },
    "vad_states": {
        "vad_states": [
            {
                "timestamp": 0.0,
                "prob": 0.5
            }
        ]
    },
    "is_last_result" : false
}
Parameters#

Parameter

Type

Required

Description

Default

event_id

string

No

Optional event identifier

event_0000

type

string

Yes

Event type

“conversation.item.input_audio_transcription.completed”

item_id

string

Yes

The ID of the item

msg_0000

content_index

integer

Yes

The index of the content part

0

transcript

string

Yes

The complete transcribed text

-

words_info

object

No

Word-level information container

-

words_info.words

array

No

Array of word objects with timing and confidence

[]

words_info.words[].word

string

Yes

The transcribed word

-

words_info.words[].start_time

float

Yes

Start time of the word in seconds

-

words_info.words[].end_time

float

Yes

End time of the word in seconds

-

words_info.words[].confidence

float

Yes

Confidence score for the word (0.0-1.0)

-

words_info.words[].speaker_tag

integer

Yes

Speaker identifier for diarization

0

vad_states

object

No

Voice Activity Detection states container

-

vad_states.vad_states

array

No

Array of VAD state objects

[]

vad_states.vad_states[].timestamp

float

Yes

Timestamp in seconds

-

vad_states.vad_states[].prob

float

Yes

VAD probability (0.0-1.0)

-

is_last_result

boolean

No

Indicates if this is the final transcription result for the audio stream

false

conversation.item.input_audio_transcription.failed#

Returned when a transcription request fails.

{
    "event_id": "event_0000",
    "type": "conversation.item.input_audio_transcription.failed",
    "item_id": "msg_0000",
    "content_index": 0,
    "error": {
        "type": "transcription_error",
        "code": "audio_unintelligible",
        "message": "The audio could not be transcribed.",
        "param": null
    }
}
Parameters#

Parameter

Type

Required

Description

event_id

string

No

Optional event identifier

type

string

Yes

Must be ‘conversation.item.input_audio_transcription.failed’

item_id

string

Yes

The ID of the user message item

content_index

integer

Yes

The index of the content part

error.type

string

Yes

The type of error

error.code

string

Yes

Error code

error.message

string

Yes

A human-readable error message

error.param

string

No

Parameter related to the error, if any

error#

Returned when an error occurs.

{
    "event_id": "<auto_generated>",
    "type": "error",
    "error": {
        "type": "invalid_request_error",
        "code": "invalid_event",
        "message": "The 'type' field is missing.",
        "param": null
    }
}
Parameters#

Parameter

Type

Required

Description

event_id

string

No

Optional event identifier

type

string

Yes

Must be ‘error’

error.type

string

Yes

The type of error

error.code

string

Yes

Error code

error.message

string

Yes

A human-readable error message

error.param

string

No

Parameter related to the error, if any

Configuration#

Server Parameters#

Parameter

Default Value

Description

expiration_timeout_secs

3600

Session expiration timeout in seconds (1 hour)

inactivity_timeout_secs

60

Inactivity timeout in seconds

max_connections

100

Maximum number of concurrent connections

max_message_size

10485760

Maximum message size in bytes (10MB)

input_min_chunk_seconds

0.08

Minimum audio chunk size for processing

Error Handling#

The realtime server implements comprehensive error handling for various scenarios:

WebSocket Error Codes#

Code

Description

Action

1000

Normal closure

Connection closed normally

1008

Policy violation

Invalid intent or unsupported operation

1011

Internal error

Server encountered an error

1013

Try again later

Server temporarily unavailable

Common Error Scenarios#

  1. Invalid Intent: Connection closed with code 1008 if intent is missing or unsupported

  2. Message Size Limits: Errors returned for messages exceeding 10MB limit

  3. Audio Data Issues: Validation errors for malformed or unsupported audio formats

  4. Session Timeout: Connections closed after inactivity timeout (60 seconds default)

  5. Server Overload: Connection refused when max connections (100) is reached

Error Response Format#

All errors follow the standard error event format:

{
    "event_id": "event_0000",
    "type": "error",
    "error": {
        "type": "error_type",
        "code": "error_code",
        "message": "Human-readable error message",
        "param": "Additional parameter if applicable"
    }
}

Client Development Resources#

For building realtime WebSocket clients in Python, refer to the NVIDIA Riva Python Clients repository.

Quick Start#

git clone https://github.com/nvidia-riva/python-clients.git
pip install -r requirements.txt
python scripts/asr/realtime_asr_client.py --help

See the repository for complete examples and API documentation.