Realtime API Reference#
Overview#
Riva Realtime Server provides a WebSocket-based API for real-time speech processing (ASR - Automatic Speech Recognition). This API allows you to stream audio data and receive real-time transcription results.
Reference#
The WebSocket server provides real-time communication capabilities for transcription services. To establish a connection, clients must connect to the WebSocket endpoint with the required query parameter.
Transcription Sessions Endpoint#
The transcription sessions endpoint allows you to create transcription sessions.
Base URL:
http://<address>:9000
Endpoint:
/v1/realtime/transcription_sessions
Method: POST
Response: Returns the initial default transcription session configuration.
{
"id": "sess_<uuid4>",
"object": "realtime.transcription_session",
"modalities": ["text"],
"input_audio_format": "pcm16",
"input_audio_transcription":
{
"language": "en-US",
"model": "conformer",
"prompt": ""
},
"input_audio_params":
{
"sample_rate_hz": 16000,
"num_channels": 1
},
"recognition_config":
{
"max_alternatives": 1,
"enable_automatic_punctuation": false,
"enable_word_time_offsets": false,
"enable_profanity_filter": false,
"enable_verbatim_transcripts": false
},
"speaker_diarization":
{
"enable_speaker_diarization": false,
"max_speaker_count": 8
},
"word_boosting":
{
"enable_word_boosting": false,
"word_boosting_list": []
},
"endpointing_config":
{
"start_history": 0,
"start_threshold": 0,
"stop_history": 0,
"stop_threshold": 0,
"stop_history_eou": 0,
"stop_threshold_eou": 0
},
"client_secret" : null
}
Parameters#
Parameter |
Type |
Required |
Description |
Default |
---|---|---|---|---|
id |
string |
No |
Session identifier |
auto-generated (“sess_ |
object |
string |
Yes |
Object type identifier |
“realtime.transcription_session” |
modalities |
array |
Yes |
List of supported modalities |
[“text”] |
input_audio_format |
string |
Yes |
Audio format. Currently only “pcm16” supported |
“pcm16” |
input_audio_transcription.language |
string |
No |
Transcription language |
“en-US” |
input_audio_transcription.model |
string |
No |
ASR model to use |
“conformer” |
input_audio_transcription.prompt |
string |
No |
Optional prompt for transcription |
“” |
input_audio_params.sample_rate_hz |
integer |
No |
Audio sample rate in Hz |
16000 |
input_audio_params.num_channels |
integer |
No |
Number of audio channels |
1 |
recognition_config.max_alternatives |
integer |
No |
Maximum number of recognition alternatives |
1 |
recognition_config.enable_automatic_punctuation |
boolean |
No |
Enable automatic punctuation |
false |
recognition_config.enable_word_time_offsets |
boolean |
No |
Enable word-level timing information |
false |
recognition_config.enable_profanity_filter |
boolean |
No |
Enable profanity filtering |
false |
recognition_config.enable_verbatim_transcripts |
boolean |
No |
Enable verbatim transcription |
false |
speaker_diarization.enable_speaker_diarization |
boolean |
No |
Enable speaker diarization |
false |
speaker_diarization.max_speaker_count |
integer |
No |
Maximum number of speakers to detect |
8 |
word_boosting.enable_word_boosting |
boolean |
No |
Enable word boosting |
false |
word_boosting.word_boosting_list |
array |
No |
List of words to boost |
[] |
endpointing_config.start_history |
integer |
No |
Start history for endpointing |
0 |
endpointing_config.start_threshold |
integer |
No |
Start threshold for endpointing |
0 |
endpointing_config.stop_history |
integer |
No |
Stop history for endpointing |
0 |
endpointing_config.stop_threshold |
integer |
No |
Stop threshold for endpointing |
0 |
endpointing_config.stop_history_eou |
integer |
No |
Stop history for end-of-utterance |
0 |
endpointing_config.stop_threshold_eou |
integer |
No |
Stop threshold for end-of-utterance |
0 |
client_secret |
string |
No |
Client authentication secret |
null |
WebSocket Connection Details#
Base URL:
ws://<address>:9000
Endpoint:
/v1/realtime
Required Query Parameter:
intent=transcription
Example Connection URL#
Here’s a complete example for connecting to the WebSocket server:
ws://localhost:9000/v1/realtime?intent=transcription
Connection Requirements#
To establish a connection, the WebSocket client must include the intent
query parameter in the URL, specifying a supported value. At present, the only valid intent is “transcription”. The server listens on port 9000 by default and uses the standard WebSocket protocol (ws://
). If the intent parameter is missing or invalid, the server will close the connection and return WebSocket code 1008 (Policy Violation).
Usage Notes#
Clients should support WebSocket connections and maintain the connection for the entire duration of the transcription session. It is important to implement proper error handling and reconnection logic within the client to ensure a robust and reliable experience.
Health Check Endpoint#
The health check endpoint provides a way to verify the server’s operational status.
Endpoint:
/v1/health
Method: GET
Response:
{
"status": "ok"
}
Status Codes:
200 OK
: Server is healthy and ready to accept connections503 Service Unavailable
: Server is not ready to accept connections
Use Cases:
Pre-flight check before establishing WebSocket connections
Load balancer health monitoring
System status monitoring
Events#
WebSocket events#
The realtime server uses a WebSocket-based event system for communication between clients and the server. Events are JSON messages that follow a specific format and are used to handle various operations like session management, audio processing, and transcription.
Each event has:
A unique
event_id
for trackingA
type
field indicating the event typeAdditional fields specific to the event type
Events are categorized into:
Client Events: Events sent from client to server
Session management (create, update)
Audio buffer operations (append, commit)
Server Events: Events sent from server to client
Session responses (created, updated)
Transcription results (delta, completed, failed)
Error notifications
Status updates
The server validates all incoming events and sends appropriate error messages for:
Invalid event formats
Unsupported features
Message size limits
Server errors

Client Events#
Events that can be sent from the client to the server.
List of Client Events#
Event Type |
Description |
---|---|
|
Updates session configuration |
|
Sends audio data for processing |
|
Commits the current audio buffer |
|
Clears the audio bytes in the buffer |
|
Tells the server that the client is done sending audio data |
transcription_session.update#
Send this event to update a transcription session.
{
"event_id" : "event_<uuid4>",
"session":
{
"modalities": ["text"],
"input_audio_format": "pcm16",
"input_audio_transcription":
{
"language": "en-US",
"model": "conformer",
"prompt": ""
},
"input_audio_params":
{
"sample_rate_hz": 16000,
"num_channels": 1
},
"recognition_config":
{
"max_alternatives": 1,
"enable_automatic_punctuation": false,
"enable_word_time_offsets": false,
"enable_profanity_filter": false,
"enable_verbatim_transcripts": false
},
"speaker_diarization":
{
"enable_speaker_diarization": false,
"max_speaker_count": 8
},
"word_boosting":
{
"enable_word_boosting": false,
"word_boosting_list": []
},
"endpointing_config":
{
"start_history": 0,
"start_threshold": 0,
"stop_history": 0,
"stop_threshold": 0,
"stop_history_eou": 0,
"stop_threshold_eou": 0
}
},
"type": "transcription_session.update"
}
Parameters#
Parameter |
Type |
Required |
Description |
Default |
---|---|---|---|---|
event_id |
string |
No |
Event identifier |
auto-generated (“conv_uuid4”) |
type |
string |
Yes |
Event type |
“transcription_session.update” |
session.modalities |
array |
Yes |
List of supported modalities |
[“text”] |
session.input_audio_format |
string |
Yes |
Audio format. Currently only “pcm16” is supported |
“pcm16” |
session.input_audio_transcription.language |
string |
No |
Transcription language |
“en-US” |
session.input_audio_transcription.model |
string |
No |
ASR model to use |
“conformer” |
session.input_audio_transcription.prompt |
string |
No |
Optional prompt for transcription |
“” |
session.input_audio_params.sample_rate_hz |
integer |
No |
Audio sample rate in Hz |
16000 |
session.input_audio_params.num_channels |
integer |
No |
Number of audio channels |
1 |
session.recognition_config.max_alternatives |
integer |
No |
Maximum number of recognition alternatives |
1 |
session.recognition_config.enable_automatic_punctuation |
boolean |
No |
Enable automatic punctuation |
false |
session.recognition_config.enable_word_time_offsets |
boolean |
No |
Enable word-level timing information |
false |
session.recognition_config.enable_profanity_filter |
boolean |
No |
Enable profanity filtering |
false |
session.recognition_config.enable_verbatim_transcripts |
boolean |
No |
Enable verbatim transcription |
false |
session.speaker_diarization.enable_speaker_diarization |
boolean |
No |
Enable speaker diarization |
false |
session.speaker_diarization.max_speaker_count |
integer |
No |
Maximum number of speakers to detect |
8 |
session.word_boosting.enable_word_boosting |
boolean |
No |
Enable word boosting |
false |
session.word_boosting.word_boosting_list |
array |
No |
List of words to boost |
[] |
session.endpointing_config.start_history |
integer |
No |
Start history for endpointing |
0 |
session.endpointing_config.start_threshold |
integer |
No |
Start threshold for endpointing |
0 |
session.endpointing_config.stop_history |
integer |
No |
Stop history for endpointing |
0 |
session.endpointing_config.stop_threshold |
integer |
No |
Stop threshold for endpointing |
0 |
session.endpointing_config.stop_history_eou |
integer |
No |
Stop history for end-of-utterance |
0 |
session.endpointing_config.stop_threshold_eou |
integer |
No |
Stop threshold for end-of-utterance |
0 |
input_audio_buffer.append#
Sends audio data to the server for processing.
{
"event_id": "event_0000",
"type": "input_audio_buffer.append",
"audio": "<Base64EncodedAudioData>"
}
Parameters#
Parameter |
Type |
Required |
Description |
Default |
---|---|---|---|---|
event_id |
string |
No |
Optional event identifier |
event_0000 |
type |
string |
Yes |
Event type |
“input_audio_buffer.append” |
audio |
string |
Yes |
Base64-encoded audio data. Maximum size: 15MB |
- |
input_audio_buffer.commit#
Commits the current audio buffer for processing.
{
"event_id": "event_0000",
"type": "input_audio_buffer.commit"
}
Parameters#
Parameter |
Type |
Required |
Description |
Default |
---|---|---|---|---|
event_id |
string |
No |
Optional event identifier |
event_0000 |
type |
string |
Yes |
Event type |
“input_audio_buffer.commit” |
input_audio_buffer.done#
Tells the server that the client is done sending audio data and wants to stop the inference processing. This event triggers the server to process any remaining audio chunks in the buffer and then stop the inference task.
Note
This parameter is mandatory for audio file processing.```
{
"event_id": "event_0000",
"type": "input_audio_buffer.done"
}
Parameters#
Parameter |
Type |
Required |
Description |
Default |
---|---|---|---|---|
event_id |
string |
No |
Optional event identifier |
event_0000 |
type |
string |
Yes |
Event type |
“input_audio_buffer.done” |
input_audio_buffer.clear#
Clears the current audio buffer.
{
"event_id": "event_0000",
"type": "input_audio_buffer.clear"
}
Parameters#
Parameter |
Type |
Required |
Description |
Default |
---|---|---|---|---|
event_id |
string |
No |
Optional event identifier |
event_0000 |
type |
string |
Yes |
Event type |
“input_audio_buffer.clear” |
Server Events#
These are events emitted from the server to the client.
List of Server Events#
Event Type |
Description |
---|---|
|
Returned when a conversation is created |
|
Sent when session configuration is updated |
|
Returned when an input audio buffer is committed |
|
Returned when the input audio buffer is cleared |
|
Sent when new transcription results are available |
|
Sent when new transcription results are available |
|
Sent when transcription fails |
|
Sent when an error occurs |
conversation.created#
Returned when a conversation session is created.
{
"event_id": "event_<uuid4>",
"type": "conversation.created",
"conversation": {
"id": "conv_<uuid4>",
"object": "realtime.conversation"
}
}
Parameters#
Parameter |
Type |
Required |
Description |
Default |
---|---|---|---|---|
event_id |
string |
No |
Optional event identifier |
auto-generated (“event_uuid4”) |
type |
string |
Yes |
Event type |
“conversation.created” |
conversation.id |
string |
Yes |
The unique ID of the conversation |
auto-generated (“conv_uuid4”) |
conversation.object |
string |
Yes |
Must be ‘realtime.conversation’ |
“realtime.conversation” |
transcription_session.updated#
Returned when a transcription session is updated.
{
"event_id" : "event_<uuid4>",
"session":
{
"modalities": ["text"],
"input_audio_format": "pcm16",
"input_audio_transcription":
{
"language": "en-US",
"model": "conformer",
"prompt": ""
},
"input_audio_params":
{
"sample_rate_hz": 16000,
"num_channels": 1
},
"recognition_config":
{
"max_alternatives": 1,
"enable_automatic_punctuation": false,
"enable_word_time_offsets": false,
"enable_profanity_filter": false,
"enable_verbatim_transcripts": false
},
"speaker_diarization":
{
"enable_speaker_diarization": false,
"max_speaker_count": 8
},
"word_boosting":
{
"enable_word_boosting": false,
"word_boosting_list": []
},
"endpointing_config":
{
"start_history": 0,
"start_threshold": 0,
"stop_history": 0,
"stop_threshold": 0,
"stop_history_eou": 0,
"stop_threshold_eou": 0
}
},
"type": "transcription_session.updated"
}
Parameters#
Parameter |
Type |
Required |
Description |
Default |
---|---|---|---|---|
event_id |
string |
No |
Event identifier |
auto-generated (“event_ |
type |
string |
Yes |
Event type |
“transcription_session.updated” |
session.modalities |
array |
Yes |
List of supported modalities |
[“text”] |
session.input_audio_format |
string |
Yes |
Audio format. Currently only “pcm16” supported |
“pcm16” |
session.input_audio_transcription.language |
string |
No |
Transcription language |
“en-US” |
session.input_audio_transcription.model |
string |
No |
ASR model to use |
“conformer” |
session.input_audio_transcription.prompt |
string |
No |
Optional prompt for transcription |
“” |
session.input_audio_params.sample_rate_hz |
integer |
No |
Audio sample rate in Hz |
16000 |
session.input_audio_params.num_channels |
integer |
No |
Number of audio channels |
1 |
session.recognition_config.max_alternatives |
integer |
No |
Maximum number of recognition alternatives |
1 |
session.recognition_config.enable_automatic_punctuation |
boolean |
No |
Enable automatic punctuation |
false |
session.recognition_config.enable_word_time_offsets |
boolean |
No |
Enable word-level timing information |
false |
session.recognition_config.enable_profanity_filter |
boolean |
No |
Enable profanity filtering |
false |
session.recognition_config.enable_verbatim_transcripts |
boolean |
No |
Enable verbatim transcription |
false |
session.speaker_diarization.enable_speaker_diarization |
boolean |
No |
Enable speaker diarization |
false |
session.speaker_diarization.max_speaker_count |
integer |
No |
Maximum number of speakers to detect |
8 |
session.word_boosting.enable_word_boosting |
boolean |
No |
Enable word boosting |
false |
session.word_boosting.word_boosting_list |
array |
No |
List of words to boost |
[] |
session.endpointing_config.start_history |
integer |
No |
Start history for endpointing |
0 |
session.endpointing_config.start_threshold |
integer |
No |
Start threshold for endpointing |
0 |
session.endpointing_config.stop_history |
integer |
No |
Stop history for endpointing |
0 |
session.endpointing_config.stop_threshold |
integer |
No |
Stop threshold for endpointing |
0 |
session.endpointing_config.stop_history_eou |
integer |
No |
Stop history for end-of-utterance |
0 |
session.endpointing_config.stop_threshold_eou |
integer |
No |
Stop threshold for end-of-utterance |
0 |
input_audio_buffer.committed#
Returned when an input audio buffer is committed. At the same time, the buffer is sent for inference.
{
"event_id" : "event_0000",
"type": "input_audio_buffer.committed",
"previous_item_id": "msg_0000",
"item_id": "msg_0001"
}
Parameters#
Parameter |
Type |
Required |
Description |
Default |
---|---|---|---|---|
event_id |
string |
No |
Optional event identifier |
event_0000 |
type |
string |
Yes |
Event type |
“input_audio_buffer.committed” |
previous_item_id |
string |
No |
ID of the preceding item |
msg_0000 |
item_id |
string |
No |
ID of the current item |
msg_0001 |
input_audio_buffer.cleared#
Returned when the input audio buffer is cleared.
{
"event_id": "event_0000",
"type": "input_audio_buffer.cleared"
}
Parameters#
Parameter |
Type |
Required |
Description |
Default |
---|---|---|---|---|
event_id |
string |
No |
Optional event identifier |
event_0000 |
type |
string |
Yes |
Event type |
“input_audio_buffer.cleared” |
conversation.item.input_audio_transcription.delta#
Returned transcription text, when response received from gRPC server
{
"event_id": "event_0000",
"type": "conversation.item.input_audio_transcription.delta",
"item_id": "item_001",
"content_index": 0,
"delta": "Hello"
}
Parameters#
Parameter |
Type |
Required |
Description |
Default |
---|---|---|---|---|
event_id |
string |
No |
Optional event identifier |
event_0000 |
type |
string |
Yes |
Event type |
“conversation.item.input_audio_transcription.delta” |
item_id |
string |
No |
Optional item identifier |
item_0000 |
content_index |
integer |
No |
The index of the content part |
0 |
delta |
string |
Yes |
Transcription result in streaming mode |
“” |
conversation.item.input_audio_transcription.completed#
Returns transcription text when a response is received from the gRPC server.
{
"event_id": "event_0000",
"type": "conversation.item.input_audio_transcription.completed",
"item_id": "msg_0000",
"content_index": 0,
"transcript": "Hello, how are you?",
"words_info": {
"words": [
{
"word": "Hello",
"start_time": 0.0,
"end_time": 1.0,
"confidence": 0.95,
"speaker_tag": 0
}
]
},
"vad_states": {
"vad_states": [
{
"timestamp": 0.0,
"prob": 0.5
}
]
},
"is_last_result" : false
}
Parameters#
Parameter |
Type |
Required |
Description |
Default |
---|---|---|---|---|
event_id |
string |
No |
Optional event identifier |
event_0000 |
type |
string |
Yes |
Event type |
“conversation.item.input_audio_transcription.completed” |
item_id |
string |
Yes |
The ID of the item |
msg_0000 |
content_index |
integer |
Yes |
The index of the content part |
0 |
transcript |
string |
Yes |
The complete transcribed text |
- |
words_info |
object |
No |
Word-level information container |
- |
words_info.words |
array |
No |
Array of word objects with timing and confidence |
[] |
words_info.words[].word |
string |
Yes |
The transcribed word |
- |
words_info.words[].start_time |
float |
Yes |
Start time of the word in seconds |
- |
words_info.words[].end_time |
float |
Yes |
End time of the word in seconds |
- |
words_info.words[].confidence |
float |
Yes |
Confidence score for the word (0.0-1.0) |
- |
words_info.words[].speaker_tag |
integer |
Yes |
Speaker identifier for diarization |
0 |
vad_states |
object |
No |
Voice Activity Detection states container |
- |
vad_states.vad_states |
array |
No |
Array of VAD state objects |
[] |
vad_states.vad_states[].timestamp |
float |
Yes |
Timestamp in seconds |
- |
vad_states.vad_states[].prob |
float |
Yes |
VAD probability (0.0-1.0) |
- |
is_last_result |
boolean |
No |
Indicates if this is the final transcription result for the audio stream |
false |
conversation.item.input_audio_transcription.failed#
Returned when a transcription request fails.
{
"event_id": "event_0000",
"type": "conversation.item.input_audio_transcription.failed",
"item_id": "msg_0000",
"content_index": 0,
"error": {
"type": "transcription_error",
"code": "audio_unintelligible",
"message": "The audio could not be transcribed.",
"param": null
}
}
Parameters#
Parameter |
Type |
Required |
Description |
---|---|---|---|
event_id |
string |
No |
Optional event identifier |
type |
string |
Yes |
Must be ‘conversation.item.input_audio_transcription.failed’ |
item_id |
string |
Yes |
The ID of the user message item |
content_index |
integer |
Yes |
The index of the content part |
error.type |
string |
Yes |
The type of error |
error.code |
string |
Yes |
Error code |
error.message |
string |
Yes |
A human-readable error message |
error.param |
string |
No |
Parameter related to the error, if any |
error#
Returned when an error occurs.
{
"event_id": "<auto_generated>",
"type": "error",
"error": {
"type": "invalid_request_error",
"code": "invalid_event",
"message": "The 'type' field is missing.",
"param": null
}
}
Parameters#
Parameter |
Type |
Required |
Description |
---|---|---|---|
event_id |
string |
No |
Optional event identifier |
type |
string |
Yes |
Must be ‘error’ |
error.type |
string |
Yes |
The type of error |
error.code |
string |
Yes |
Error code |
error.message |
string |
Yes |
A human-readable error message |
error.param |
string |
No |
Parameter related to the error, if any |
Configuration#
Server Parameters#
Parameter |
Default Value |
Description |
---|---|---|
expiration_timeout_secs |
3600 |
Session expiration timeout in seconds (1 hour) |
inactivity_timeout_secs |
60 |
Inactivity timeout in seconds |
max_connections |
100 |
Maximum number of concurrent connections |
max_message_size |
10485760 |
Maximum message size in bytes (10MB) |
input_min_chunk_seconds |
0.08 |
Minimum audio chunk size for processing |
Error Handling#
The realtime server implements comprehensive error handling for various scenarios:
WebSocket Error Codes#
Code |
Description |
Action |
---|---|---|
1000 |
Normal closure |
Connection closed normally |
1008 |
Policy violation |
Invalid intent or unsupported operation |
1011 |
Internal error |
Server encountered an error |
1013 |
Try again later |
Server temporarily unavailable |
Common Error Scenarios#
Invalid Intent: Connection closed with code 1008 if intent is missing or unsupported
Message Size Limits: Errors returned for messages exceeding 10MB limit
Audio Data Issues: Validation errors for malformed or unsupported audio formats
Session Timeout: Connections closed after inactivity timeout (60 seconds default)
Server Overload: Connection refused when max connections (100) is reached
Error Response Format#
All errors follow the standard error event format:
{
"event_id": "event_0000",
"type": "error",
"error": {
"type": "error_type",
"code": "error_code",
"message": "Human-readable error message",
"param": "Additional parameter if applicable"
}
}
Client Development Resources#
For building realtime WebSocket clients in Python, refer to the NVIDIA Riva Python Clients repository.
Quick Start#
git clone https://github.com/nvidia-riva/python-clients.git
pip install -r requirements.txt
python scripts/asr/realtime_asr_client.py --help
See the repository for complete examples and API documentation.