Overview
The Text-to-Speech (TTS) API converts text into natural-sounding speech audio. VoxNexus supports both REST API and WebSocket API for TTS operations.REST API
The REST API endpoint/v1/tts supports synchronous and streaming audio generation.
Basic Usage
Request Parameters
The text content to convert to speech.
Unique identifier of the voice to use. Use the
/v1/voices endpoint to browse available voices.Language or locale code. Supports both ISO 639-1 language codes (e.g.,
en, zh) and BCP 47 locale codes (e.g., en-US, zh-CN). When a language code is provided, the system will automatically resolve it to the most common locale (e.g., en → en-US). Optional, but recommended for better accuracy.Audio format. Supported values:
wav, pcm. Default: wav.Sample rate in Hz. Supported values:
16000, 24000, 48000. Default: 16000.Bit rate in kbps. Not supported yet - reserved for future compressed format support. Default:
128.Speech rate multiplier. Range:
0.5 - 2.0. Default: 1.0.Pitch offset in semitones. Range:
-12 - 12. Default: 0.Volume multiplier. Range:
0.0 - 1.0. Default: 1.0.Voice-specific configuration object. Properties depend on the selected voice. Check voice details using
/v1/voices/{voice_id} endpoint.Response
The API returns audio data in the requested format. Response headers include metadata:X-Request-ID: Unique request identifierX-Voice-ID: Voice ID used for synthesisX-Language: Language codeX-Audio-Format: Audio formatX-Sample-Rate: Sample rateX-Duration-Ms: Audio duration in millisecondsX-Created-At: Creation timestampTransfer-Encoding: Transfer encoding (defaults to chunked streaming)
Streaming Response
By default, the API uses chunked transfer encoding for streaming audio data. This allows you to start playing audio while it’s still being generated, reducing latency.WebSocket API
The WebSocket API provides real-time bidirectional communication for TTS operations, ideal for interactive applications.Connection
Connect towss://api.voxnexus.ai/v1/tts/realtime with authentication:
Message Flow
- Initialize: Send an
initmessage to configure voice parameters - Send Text: Send
textmessages with content to synthesize - Receive Audio: Receive
audiomessages with Base64-encoded audio data - Handle Errors: Monitor for
errormessages
Initialization Message
Text Message
Audio Response
Complete Example
Best Practices
Voice Selection
- Use the
/v1/voicesendpoint to browse available voices - Filter voices by language, gender, age, or style
- Test voices using sample audio URLs before production use
Performance Optimization
- Use streaming for long texts to reduce perceived latency
- Choose appropriate sample rates (16kHz is sufficient for most use cases)
- Use PCM format for real-time WebSocket streaming, WAV for REST API
Error Handling
Always implement proper error handling:Rate Limits and Quotas
- Implement exponential backoff for
429responses - Consider using WebSocket API for high-frequency use cases