Realtime API Reference#

Overview#

Riva Realtime Server provides a WebSocket-based API for real-time speech processing (ASR - Automatic Speech Recognition). This API allows you to stream audio data and receive real-time transcription results.

Reference#

The WebSocket server provides real-time communication capabilities for transcription services. To establish a connection, clients must connect to the WebSocket endpoint with the required query parameter.

Transcription Sessions Endpoint#

The transcription sessions endpoint allows you to create transcription sessions.

Base URL:

http://<address>:9000

Endpoint:

/v1/realtime/transcription_sessions

Method: POST

Response: Returns the initial default transcription session configuration.

{
  "id": "sess_<uuid4>",
  "object": "realtime.transcription_session",
  "modalities": ["text"],
  "input_audio_format": "pcm16",
  "input_audio_transcription":
  {
      "language": "en-US",
      "model": "conformer",
      "prompt": ""
  },
  "input_audio_params":
  {
      "sample_rate_hz": 16000,
      "num_channels": 1
  },
  "recognition_config":
  {
      "max_alternatives": 1,
      "enable_automatic_punctuation": false,
      "enable_word_time_offsets": false,
      "enable_profanity_filter": false,
      "enable_verbatim_transcripts": false
  },
  "speaker_diarization":
  {
      "enable_speaker_diarization": false,
      "max_speaker_count": 8
  },
  "word_boosting":
  {
      "enable_word_boosting": false,
      "word_boosting_list": []
  },
  "endpointing_config":
  {
      "start_history": 0,
      "start_threshold": 0,
      "stop_history": 0,
      "stop_threshold": 0,
      "stop_history_eou": 0,
      "stop_threshold_eou": 0
  },
  "client_secret" : null
}

Parameters#

Parameter	Type	Required	Description	Default
id	string	No	Session identifier	auto-generated (“sess_”)
object	string	Yes	Object type identifier	“realtime.transcription_session”
modalities	array	Yes	List of supported modalities	[“text”]
input_audio_format	string	Yes	Audio format. Currently only “pcm16” supported	“pcm16”
input_audio_transcription.language	string	No	Transcription language	“en-US”
input_audio_transcription.model	string	No	ASR model to use	“conformer”
input_audio_transcription.prompt	string	No	Optional prompt for transcription	“”
input_audio_params.sample_rate_hz	integer	No	Audio sample rate in Hz	16000
input_audio_params.num_channels	integer	No	Number of audio channels	1
recognition_config.max_alternatives	integer	No	Maximum number of recognition alternatives	1
recognition_config.enable_automatic_punctuation	boolean	No	Enable automatic punctuation	false
recognition_config.enable_word_time_offsets	boolean	No	Enable word-level timing information	false
recognition_config.enable_profanity_filter	boolean	No	Enable profanity filtering	false
recognition_config.enable_verbatim_transcripts	boolean	No	Enable verbatim transcription	false
speaker_diarization.enable_speaker_diarization	boolean	No	Enable speaker diarization	false
speaker_diarization.max_speaker_count	integer	No	Maximum number of speakers to detect	8
word_boosting.enable_word_boosting	boolean	No	Enable word boosting	false
word_boosting.word_boosting_list	array	No	List of words to boost	[]
endpointing_config.start_history	integer	No	Start history for endpointing	0
endpointing_config.start_threshold	integer	No	Start threshold for endpointing	0
endpointing_config.stop_history	integer	No	Stop history for endpointing	0
endpointing_config.stop_threshold	integer	No	Stop threshold for endpointing	0
endpointing_config.stop_history_eou	integer	No	Stop history for end-of-utterance	0
endpointing_config.stop_threshold_eou	integer	No	Stop threshold for end-of-utterance	0
client_secret	string	No	Client authentication secret	null

WebSocket Connection Details#

Base URL:

ws://<address>:9000

Endpoint:

/v1/realtime

Required Query Parameter:

intent=transcription

Example Connection URL#

Here’s a complete example for connecting to the WebSocket server:

ws://localhost:9000/v1/realtime?intent=transcription

Connection Requirements#

To establish a connection, the WebSocket client must include the intent query parameter in the URL, specifying a supported value. At present, the only valid intent is “transcription”. The server listens on port 9000 by default and uses the standard WebSocket protocol (ws://). If the intent parameter is missing or invalid, the server will close the connection and return WebSocket code 1008 (Policy Violation).

Usage Notes#

Clients should support WebSocket connections and maintain the connection for the entire duration of the transcription session. It is important to implement proper error handling and reconnection logic within the client to ensure a robust and reliable experience.

Health Check Endpoint#

The health check endpoint provides a way to verify the server’s operational status.

Endpoint:

/v1/health

Method: GET

Response:

{
  "status": "ok"
}

Status Codes:

200 OK: Server is healthy and ready to accept connections
503 Service Unavailable: Server is not ready to accept connections

Use Cases:

Pre-flight check before establishing WebSocket connections
Load balancer health monitoring
System status monitoring

Events#

WebSocket events#

The realtime server uses a WebSocket-based event system for communication between clients and the server. Events are JSON messages that follow a specific format and are used to handle various operations like session management, audio processing, and transcription.

Each event has:

A unique event_id for tracking
A type field indicating the event type
Additional fields specific to the event type

Events are categorized into:

Client Events: Events sent from client to server
- Session management (create, update)
- Audio buffer operations (append, commit)
Server Events: Events sent from server to client
- Session responses (created, updated)
- Transcription results (delta, completed, failed)
- Error notifications
- Status updates

The server validates all incoming events and sends appropriate error messages for:

Invalid event formats
Unsupported features
Message size limits
Server errors

Client Events#

Events that can be sent from the client to the server.

List of Client Events#

Event Type	Description
`transcription_session.update`	Updates session configuration
`input_audio_buffer.append`	Sends audio data for processing
`input_audio_buffer.commit`	Commits the current audio buffer
`input_audio_buffer.clear`	Clears the audio bytes in the buffer
`input_audio_buffer.done`	Tells the server that the client is done sending audio data

transcription_session.update#

Send this event to update a transcription session.

{
    "event_id" : "event_<uuid4>",
    "session":
    {
        "modalities": ["text"],
        "input_audio_format": "pcm16",
        "input_audio_transcription":
        {
            "language": "en-US",
            "model": "conformer",
            "prompt": ""
        },
        "input_audio_params":
        {
            "sample_rate_hz": 16000,
            "num_channels": 1
        },
        "recognition_config":
        {
            "max_alternatives": 1,
            "enable_automatic_punctuation": false,
            "enable_word_time_offsets": false,
            "enable_profanity_filter": false,
            "enable_verbatim_transcripts": false
        },
        "speaker_diarization":
        {
            "enable_speaker_diarization": false,
            "max_speaker_count": 8
        },
        "word_boosting":
        {
            "enable_word_boosting": false,
            "word_boosting_list": []
        },
        "endpointing_config":
        {
            "start_history": 0,
            "start_threshold": 0,
            "stop_history": 0,
            "stop_threshold": 0,
            "stop_history_eou": 0,
            "stop_threshold_eou": 0
        }
    },
    "type": "transcription_session.update"
}

Parameters#

Parameter	Type	Required	Description	Default
event_id	string	No	Event identifier	auto-generated (“conv_uuid4”)
type	string	Yes	Event type	“transcription_session.update”
session.modalities	array	Yes	List of supported modalities	[“text”]
session.input_audio_format	string	Yes	Audio format. Currently only “pcm16” is supported	“pcm16”
session.input_audio_transcription.language	string	No	Transcription language	“en-US”
session.input_audio_transcription.model	string	No	ASR model to use	“conformer”
session.input_audio_transcription.prompt	string	No	Optional prompt for transcription	“”
session.input_audio_params.sample_rate_hz	integer	No	Audio sample rate in Hz	16000
session.input_audio_params.num_channels	integer	No	Number of audio channels	1
session.recognition_config.max_alternatives	integer	No	Maximum number of recognition alternatives	1
session.recognition_config.enable_automatic_punctuation	boolean	No	Enable automatic punctuation	false
session.recognition_config.enable_word_time_offsets	boolean	No	Enable word-level timing information	false
session.recognition_config.enable_profanity_filter	boolean	No	Enable profanity filtering	false
session.recognition_config.enable_verbatim_transcripts	boolean	No	Enable verbatim transcription	false
session.speaker_diarization.enable_speaker_diarization	boolean	No	Enable speaker diarization	false
session.speaker_diarization.max_speaker_count	integer	No	Maximum number of speakers to detect	8
session.word_boosting.enable_word_boosting	boolean	No	Enable word boosting	false
session.word_boosting.word_boosting_list	array	No	List of words to boost	[]
session.endpointing_config.start_history	integer	No	Start history for endpointing	0
session.endpointing_config.start_threshold	integer	No	Start threshold for endpointing	0
session.endpointing_config.stop_history	integer	No	Stop history for endpointing	0
session.endpointing_config.stop_threshold	integer	No	Stop threshold for endpointing	0
session.endpointing_config.stop_history_eou	integer	No	Stop history for end-of-utterance	0
session.endpointing_config.stop_threshold_eou	integer	No	Stop threshold for end-of-utterance	0

input_audio_buffer.append#

Sends audio data to the server for processing.

{
  "event_id": "event_0000",
  "type": "input_audio_buffer.append",
  "audio": "<Base64EncodedAudioData>"
}

Parameters#

Parameter	Type	Required	Description	Default
event_id	string	No	Optional event identifier	event_0000
type	string	Yes	Event type	“input_audio_buffer.append”
audio	string	Yes	Base64-encoded audio data. Maximum size: 15MB	-

input_audio_buffer.commit#

Commits the current audio buffer for processing.

{
  "event_id": "event_0000",
  "type": "input_audio_buffer.commit"
}

Parameters#

Parameter	Type	Required	Description	Default
event_id	string	No	Optional event identifier	event_0000
type	string	Yes	Event type	“input_audio_buffer.commit”

input_audio_buffer.done#

Tells the server that the client is done sending audio data and wants to stop the inference processing. This event triggers the server to process any remaining audio chunks in the buffer and then stop the inference task.

Note

This parameter is mandatory for audio file processing.```

{
    "event_id": "event_0000",
    "type": "input_audio_buffer.done"
}

Parameters#

Parameter	Type	Required	Description	Default
event_id	string	No	Optional event identifier	event_0000
type	string	Yes	Event type	“input_audio_buffer.done”

input_audio_buffer.clear#

Clears the current audio buffer.

{
  "event_id": "event_0000",
  "type": "input_audio_buffer.clear"
}

Parameters#

Parameter	Type	Required	Description	Default
event_id	string	No	Optional event identifier	event_0000
type	string	Yes	Event type	“input_audio_buffer.clear”

Server Events#

These are events emitted from the server to the client.

List of Server Events#

Event Type	Description
`conversation.created`	Returned when a conversation is created
`transcription_session.updated`	Sent when session configuration is updated
`input_audio_buffer.committed`	Returned when an input audio buffer is committed
`input_audio_buffer.cleared`	Returned when the input audio buffer is cleared
`conversation.item.input_audio_transcription.delta`	Sent when new transcription results are available
`conversation.item.input_audio_transcription.completed`	Sent when new transcription results are available
`conversation.item.input_audio_transcription.failed`	Sent when transcription fails
`error`	Sent when an error occurs

conversation.created#

Returned when a conversation session is created.

{
    "event_id": "event_<uuid4>",
    "type": "conversation.created",
    "conversation": {
        "id": "conv_<uuid4>",
        "object": "realtime.conversation"
    }
}

Parameters#

Parameter	Type	Required	Description	Default
event_id	string	No	Optional event identifier	auto-generated (“event_uuid4”)
type	string	Yes	Event type	“conversation.created”
conversation.id	string	Yes	The unique ID of the conversation	auto-generated (“conv_uuid4”)
conversation.object	string	Yes	Must be ‘realtime.conversation’	“realtime.conversation”

transcription_session.updated#

Returned when a transcription session is updated.

{
    "event_id" : "event_<uuid4>",
    "session":
    {
        "modalities": ["text"],
        "input_audio_format": "pcm16",
        "input_audio_transcription":
        {
            "language": "en-US",
            "model": "conformer",
            "prompt": ""
        },
        "input_audio_params":
        {
            "sample_rate_hz": 16000,
            "num_channels": 1
        },
        "recognition_config":
        {
            "max_alternatives": 1,
            "enable_automatic_punctuation": false,
            "enable_word_time_offsets": false,
            "enable_profanity_filter": false,
            "enable_verbatim_transcripts": false
        },
        "speaker_diarization":
        {
            "enable_speaker_diarization": false,
            "max_speaker_count": 8
        },
        "word_boosting":
        {
            "enable_word_boosting": false,
            "word_boosting_list": []
        },
        "endpointing_config":
        {
            "start_history": 0,
            "start_threshold": 0,
            "stop_history": 0,
            "stop_threshold": 0,
            "stop_history_eou": 0,
            "stop_threshold_eou": 0
        }
    },
    "type": "transcription_session.updated"
}

Parameters#

Parameter	Type	Required	Description	Default
event_id	string	No	Event identifier	auto-generated (“event_”)
type	string	Yes	Event type	“transcription_session.updated”
session.modalities	array	Yes	List of supported modalities	[“text”]
session.input_audio_format	string	Yes	Audio format. Currently only “pcm16” supported	“pcm16”
session.input_audio_transcription.language	string	No	Transcription language	“en-US”
session.input_audio_transcription.model	string	No	ASR model to use	“conformer”
session.input_audio_transcription.prompt	string	No	Optional prompt for transcription	“”
session.input_audio_params.sample_rate_hz	integer	No	Audio sample rate in Hz	16000
session.input_audio_params.num_channels	integer	No	Number of audio channels	1
session.recognition_config.max_alternatives	integer	No	Maximum number of recognition alternatives	1
session.recognition_config.enable_automatic_punctuation	boolean	No	Enable automatic punctuation	false
session.recognition_config.enable_word_time_offsets	boolean	No	Enable word-level timing information	false
session.recognition_config.enable_profanity_filter	boolean	No	Enable profanity filtering	false
session.recognition_config.enable_verbatim_transcripts	boolean	No	Enable verbatim transcription	false
session.speaker_diarization.enable_speaker_diarization	boolean	No	Enable speaker diarization	false
session.speaker_diarization.max_speaker_count	integer	No	Maximum number of speakers to detect	8
session.word_boosting.enable_word_boosting	boolean	No	Enable word boosting	false
session.word_boosting.word_boosting_list	array	No	List of words to boost	[]
session.endpointing_config.start_history	integer	No	Start history for endpointing	0
session.endpointing_config.start_threshold	integer	No	Start threshold for endpointing	0
session.endpointing_config.stop_history	integer	No	Stop history for endpointing	0
session.endpointing_config.stop_threshold	integer	No	Stop threshold for endpointing	0
session.endpointing_config.stop_history_eou	integer	No	Stop history for end-of-utterance	0
session.endpointing_config.stop_threshold_eou	integer	No	Stop threshold for end-of-utterance	0

input_audio_buffer.committed#

Returned when an input audio buffer is committed. At the same time, the buffer is sent for inference.

{
    "event_id" : "event_0000",
    "type": "input_audio_buffer.committed",
    "previous_item_id": "msg_0000",
    "item_id": "msg_0001"
}

Parameters#

Parameter	Type	Required	Description	Default
event_id	string	No	Optional event identifier	event_0000
type	string	Yes	Event type	“input_audio_buffer.committed”
previous_item_id	string	No	ID of the preceding item	msg_0000
item_id	string	No	ID of the current item	msg_0001

input_audio_buffer.cleared#

Returned when the input audio buffer is cleared.

{
    "event_id": "event_0000",
    "type": "input_audio_buffer.cleared"
}

Parameters#

Parameter	Type	Required	Description	Default
event_id	string	No	Optional event identifier	event_0000
type	string	Yes	Event type	“input_audio_buffer.cleared”

conversation.item.input_audio_transcription.delta#

Returned transcription text, when response received from gRPC server

{
    "event_id": "event_0000",
    "type": "conversation.item.input_audio_transcription.delta",
    "item_id": "item_001",
    "content_index": 0,
    "delta": "Hello"
}

Parameters#

Parameter	Type	Required	Description	Default
event_id	string	No	Optional event identifier	event_0000
type	string	Yes	Event type	“conversation.item.input_audio_transcription.delta”
item_id	string	No	Optional item identifier	item_0000
content_index	integer	No	The index of the content part	0
delta	string	Yes	Transcription result in streaming mode	“”

conversation.item.input_audio_transcription.completed#

Returns transcription text when a response is received from the gRPC server.

{
    "event_id": "event_0000",
    "type": "conversation.item.input_audio_transcription.completed",
    "item_id": "msg_0000",
    "content_index": 0,
    "transcript": "Hello, how are you?",
    "words_info": {
        "words": [
            {
                "word": "Hello",
                "start_time": 0.0,
                "end_time": 1.0,
                "confidence": 0.95,
                "speaker_tag": 0
            }
        ]
    },
    "vad_states": {
        "vad_states": [
            {
                "timestamp": 0.0,
                "prob": 0.5
            }
        ]
    },
    "is_last_result" : false
}

Parameters#

Parameter	Type	Required	Description	Default
event_id	string	No	Optional event identifier	event_0000
type	string	Yes	Event type	“conversation.item.input_audio_transcription.completed”
item_id	string	Yes	The ID of the item	msg_0000
content_index	integer	Yes	The index of the content part	0
transcript	string	Yes	The complete transcribed text	-
words_info	object	No	Word-level information container	-
words_info.words	array	No	Array of word objects with timing and confidence	[]
words_info.words[].word	string	Yes	The transcribed word	-
words_info.words[].start_time	float	Yes	Start time of the word in seconds	-
words_info.words[].end_time	float	Yes	End time of the word in seconds	-
words_info.words[].confidence	float	Yes	Confidence score for the word (0.0-1.0)	-
words_info.words[].speaker_tag	integer	Yes	Speaker identifier for diarization	0
vad_states	object	No	Voice Activity Detection states container	-
vad_states.vad_states	array	No	Array of VAD state objects	[]
vad_states.vad_states[].timestamp	float	Yes	Timestamp in seconds	-
vad_states.vad_states[].prob	float	Yes	VAD probability (0.0-1.0)	-
is_last_result	boolean	No	Indicates if this is the final transcription result for the audio stream	false

conversation.item.input_audio_transcription.failed#

Returned when a transcription request fails.

{
    "event_id": "event_0000",
    "type": "conversation.item.input_audio_transcription.failed",
    "item_id": "msg_0000",
    "content_index": 0,
    "error": {
        "type": "transcription_error",
        "code": "audio_unintelligible",
        "message": "The audio could not be transcribed.",
        "param": null
    }
}

Parameters#

Parameter	Type	Required	Description
event_id	string	No	Optional event identifier
type	string	Yes	Must be ‘conversation.item.input_audio_transcription.failed’
item_id	string	Yes	The ID of the user message item
content_index	integer	Yes	The index of the content part
error.type	string	Yes	The type of error
error.code	string	Yes	Error code
error.message	string	Yes	A human-readable error message
error.param	string	No	Parameter related to the error, if any

error#

Returned when an error occurs.

{
    "event_id": "<auto_generated>",
    "type": "error",
    "error": {
        "type": "invalid_request_error",
        "code": "invalid_event",
        "message": "The 'type' field is missing.",
        "param": null
    }
}

Parameters#

Parameter	Type	Required	Description
event_id	string	No	Optional event identifier
type	string	Yes	Must be ‘error’
error.type	string	Yes	The type of error
error.code	string	Yes	Error code
error.message	string	Yes	A human-readable error message
error.param	string	No	Parameter related to the error, if any

Configuration#

Server Parameters#

Parameter	Default Value	Description
expiration_timeout_secs	3600	Session expiration timeout in seconds (1 hour)
inactivity_timeout_secs	60	Inactivity timeout in seconds
max_connections	100	Maximum number of concurrent connections
max_message_size	10485760	Maximum message size in bytes (10MB)
input_min_chunk_seconds	0.08	Minimum audio chunk size for processing

Error Handling#

The realtime server implements comprehensive error handling for various scenarios:

WebSocket Error Codes#

Code	Description	Action
1000	Normal closure	Connection closed normally
1008	Policy violation	Invalid intent or unsupported operation
1011	Internal error	Server encountered an error
1013	Try again later	Server temporarily unavailable

Common Error Scenarios#

Invalid Intent: Connection closed with code 1008 if intent is missing or unsupported
Message Size Limits: Errors returned for messages exceeding 10MB limit
Audio Data Issues: Validation errors for malformed or unsupported audio formats
Session Timeout: Connections closed after inactivity timeout (60 seconds default)
Server Overload: Connection refused when max connections (100) is reached

Error Response Format#

All errors follow the standard error event format:

{
    "event_id": "event_0000",
    "type": "error",
    "error": {
        "type": "error_type",
        "code": "error_code",
        "message": "Human-readable error message",
        "param": "Additional parameter if applicable"
    }
}

Client Development Resources#

For building realtime WebSocket clients in Python, refer to the NVIDIA Riva Python Clients repository.

Quick Start#

git clone https://github.com/nvidia-riva/python-clients.git
pip install -r requirements.txt
python scripts/asr/realtime_asr_client.py --help

See the repository for complete examples and API documentation.