Overview
The Text-to-Speech (TTS) API converts text into natural-sounding speech audio. VoxNexus supports both REST API and WebSocket API for TTS operations.REST API
The REST API endpoint/v1/tts supports synchronous and streaming audio generation.
Basic Usage
Request Parameters
The text content to convert to speech. Supports plain text and SSML format when
ssml is set to true.Unique identifier of the voice to use. Use the
/v1/voices endpoint to browse available voices.Language code in ISO 639-1 format (e.g.,
zh-CN, en-US). Optional, but recommended for better accuracy.Audio format. Supported values:
mp3, wav, ogg, pcm, webm. Default: mp3.Sample rate in Hz. Supported values:
8000, 16000, 22050, 24000, 44100, 48000. Default: 16000.Bit rate in kbps. Only valid for compressed formats (mp3, ogg). Default:
128.Speech rate multiplier. Range:
0.5 - 2.0. Default: 1.0.Pitch offset in semitones. Range:
-12 - 12. Default: 0.Volume multiplier. Range:
0.0 - 1.0. Default: 1.0.Whether to interpret text as SSML format. Default:
false.Voice-specific configuration object. Properties depend on the selected voice. Check voice details using
/v1/voices/{voice_id} endpoint.Response
The API returns audio data in the requested format. Response headers include metadata:X-Request-ID: Unique request identifierX-Voice-ID: Voice ID used for synthesisX-Language: Language codeX-Audio-Format: Audio formatX-Sample-Rate: Sample rateX-Duration-Ms: Audio duration in millisecondsX-Created-At: Creation timestampX-RateLimit-Remaining: Remaining requestsX-Quota-Used: Credits consumed
Streaming Response
By default, the API uses chunked transfer encoding for streaming audio data. This allows you to start playing audio while it’s still being generated, reducing latency.WebSocket API
The WebSocket API provides real-time bidirectional communication for TTS operations, ideal for interactive applications.Connection
Connect towss://api.voxnexus.ai/v1/tts/realtime with authentication header:
Message Flow
- Initialize: Send an
initmessage to configure voice parameters - Send Text: Send
textmessages with content to synthesize - Receive Audio: Receive
audiomessages with Base64-encoded audio data - Handle Errors: Monitor for
errormessages
Initialization Message
Text Message
Audio Response
Complete Example
Best Practices
Voice Selection
- Use the
/v1/voicesendpoint to browse available voices - Filter voices by language, gender, age, or style
- Test voices using sample audio URLs before production use
Performance Optimization
- Use streaming for long texts to reduce perceived latency
- Choose appropriate sample rates (16kHz is sufficient for most use cases)
- Use compressed formats (mp3) for network efficiency
Error Handling
Always implement proper error handling:SSML Support
When using SSML format, setssml: true and format your text accordingly:
Rate Limits and Quotas
- Monitor
X-RateLimit-Remainingheader to track remaining requests - Check
X-Quota-Usedto understand credit consumption - Implement exponential backoff for
429responses - Consider using WebSocket API for high-frequency use cases