Overview
The Speech-to-Text (STT) API converts audio files into text transcriptions. VoxNexus supports both REST API and WebSocket API for STT operations, with features like timestamps and speaker diarization.REST API
The REST API endpoint/v1/stt processes complete audio files and returns full transcription results.
Basic Usage
Query Parameters
Model identifier. Specifies which model to use for STT. Use the
/v1/models endpoint to browse available models.Language or locale code. Supports both ISO 639-1 language codes (e.g.,
en, zh) and BCP 47 locale codes (e.g., en-US, zh-CN). When a language code is provided, the system will automatically resolve it to the most common locale (e.g., en → en-US). Default: en-US. Optional but recommended for better recognition accuracy.Sample rate in Hz. Required parameter. Common values:
16000 for telephony, 44100 for high-quality audio.Whether to return word-level timestamps. Default:
false.Whether to enable speaker diarization (identify different speakers). Default:
false.Whether to enable LLM post-processing on the STT transcript. Default:
false. When enabled, the recognized transcript is passed to the LLM with the specified llm_prompt, and the transformed result is returned in the text field. If LLM transform fails, text falls back to the raw transcript.LLM transform instruction. Required when
enable_llm_transform=true. Describes what the LLM should do with the transcript, e.g. "correct punctuation", "translate to English", "summarize key points".LLM model ID to use for transform. Optional — falls back to the server-configured default when not specified. Routing is prefix-based:
claude-* models route to Anthropic; all other model IDs route to OpenAI. Examples: claude-haiku-4-5-20251001 (Anthropic), gpt-4.1 (OpenAI).Maximum output tokens for LLM transform. Optional. Range:
1 - 4096.Request Body
The request body should contain the audio file in one of the supported formats:audio/wavaudio/mpeg(MP3)audio/pcmapplication/octet-stream
Response
Unique identifier for this request.
Detected or specified language code (e.g.,
en, en-US, zh). May not be present if language detection is not enabled.Raw STT recognition output (original ASR text before any LLM processing). Always present; identical to
text when LLM transform is not enabled.Final output text. When LLM transform is enabled and succeeds, this contains the LLM-transformed result; otherwise it equals
transcript.Audio duration in milliseconds.
Word-level information array. Only present if
enable_timestamps is true. Each item contains:word: The recognized wordoffset: Start time in millisecondsduration: Duration in millisecondsconfidence: Confidence score (0.0-1.0)
Speaker information array. Only present if
enable_speaker_diarization is true. Each item contains:speaker_id: Unique speaker identifiertext: Text spoken by this speakeroffset: Start time in millisecondsduration: Duration in milliseconds
Timestamp when the transcription was created (ISO 8601 format).
Response Headers
X-Request-ID: Request identifierX-Language: Detected language codeX-Duration-Ms: Audio duration in milliseconds
WebSocket API
The WebSocket API provides real-time speech recognition, ideal for live transcription scenarios.Connection
Connect towss://api.voxnexus.ai/v1/stt/realtime with authentication:
Message Flow
- Initialize: Send an
initmessage with recognition parameters - Send Audio: Continuously send
audiomessages with Base64-encoded audio chunks - Receive Results: Receive
transcriptmessages (is_final: falsefor interim,is_final: truefor complete sentences) - Handle Errors: Monitor for
errormessages
Initialization Message
Message type. Must be
init.Model identifier. Specifies which model to use for STT. Use the
/v1/models endpoint to browse available models.Language or locale code. Supports both ISO 639-1 language codes (e.g.,
en, zh) and BCP 47 locale codes (e.g., en-US, zh-CN). When a language code is provided, the system will automatically resolve it to the most common locale (e.g., en → en-US). Optional but recommended for better accuracy.Audio format. Only
pcm is supported.Sample rate in Hz. Only
16000 is supported.Whether to return word-level timestamps. Default:
false.Whether to enable LLM post-processing on each final transcript. Default:
false. When enabled, the server sends llm messages alongside transcript messages.LLM transform instruction. Required when
enable_llm_transform=true. E.g. "correct punctuation", "translate to English", "rewrite humorously".LLM model ID. Optional — falls back to server default when not specified. Routing is prefix-based:
claude-* models route to Anthropic; all other model IDs route to OpenAI. Examples: claude-haiku-4-5-20251001 (Anthropic), gpt-4.1 (OpenAI).Maximum output tokens for LLM transform. Optional. Range:
1 - 4096.LLM processing mode. Default:
per_segment.per_segment: LLM runs on each final sentence as it arrives (low latency, real-time).post_flush: LLM runs once on the full accumulated text after flush (full context).
When
llm_mode=per_segment, also run a full-text LLM pass after flush. Default: false. Produces both per-sentence llm messages (with segment_id) and a final full-text llm message (without segment_id) after flush completes.Separate prompt for the post-flush full-text pass. Falls back to
llm_prompt when not specified.Audio Message
Command Message
The
flush command tells the server that no more audio will be sent. The server will respond with a flush_done message after completing recognition. Subsequent audio will start a new recognition session.Server Messages
Ready MessageThe
is_final field distinguishes between partial results (false) and complete sentences (true). The confidence score and word-level information are only valid when is_final is true. segment_id is only present when is_final=true and LLM transform is enabled — it links subsequent llm messages to this segment.Each final transcript triggers a sequence of
llm messages streaming the LLM output incrementally. Concatenate all delta values until is_final=true to get the full transformed text. The segment_id links this output to its corresponding transcript message. For a full-text post-flush pass, segment_id is absent.Complete Example
Best Practices
Audio Format Selection
- PCM: Best for real-time WebSocket streaming, requires exact sample rate specification
- WAV: Good for REST API, includes format headers
- MP3: Compressed format, good for file uploads, requires decoding
Sample Rate Guidelines
- 8kHz: Telephony quality, sufficient for phone recordings
- 16kHz: Standard quality, good balance of quality and file size
- 22.05kHz: Radio quality
- 44.1kHz/48kHz: High-quality audio, use for professional recordings
Language Specification
Always specify the language when known:Timestamps
Enable timestamps for word-level timing information:Speaker Diarization
Use speaker diarization for multi-speaker scenarios:Error Handling
Implement robust error handling:Common Use Cases
Meeting Transcription
Transcribe meeting recordings with speaker identification:Live Captioning
Use WebSocket API for real-time captioning:Voice Commands
Process voice commands with real-time recognition:Audio Content Indexing
Index audio content for search:Rate Limits and Quotas
- Implement retry logic with exponential backoff for
429responses - Consider WebSocket API for continuous streaming scenarios
- Batch process large audio files during off-peak hours