Text-to-Speech Guide

Overview

The Text-to-Speech (TTS) API converts text into natural-sounding speech audio. VoxNexus supports both REST API and WebSocket API for TTS operations.

REST API

The REST API endpoint /v1/tts supports synchronous and streaming audio generation.

Basic Usage

curl -X POST https://api.voxnexus.ai/v1/tts \
  -H "X-Api-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello, this is a test message",
    "voice_id": "vl-xiaoxiao",
    "format": "wav",
    "sample_rate": 16000
  }'

Request Parameters

text

string

required

The text content to convert to speech.

voice_id

string

required

Unique identifier of the voice to use. Use the /v1/voices endpoint to browse available voices.

language

string

Language or locale code. Supports both ISO 639-1 language codes (e.g., en, zh) and BCP 47 locale codes (e.g., en-US, zh-CN). When a language code is provided, the system will automatically resolve it to the most common locale (e.g., en → en-US). Optional, but recommended for better accuracy.

format

string

Audio format. Supported values: wav, pcm. Default: wav.

sample_rate

integer

Sample rate in Hz. Supported values: 16000, 24000, 48000. Default: 16000.

bit_rate

integer

deprecated

Bit rate in kbps. Not supported yet - reserved for future compressed format support. Default: 128.

speed

number

Speech rate multiplier. Range: 0.5 - 2.0. Default: 1.0.

pitch

integer

Pitch offset in semitones. Range: -12 - 12. Default: 0.

volume

number

Volume multiplier. Range: 0.0 - 1.0. Default: 1.0.

voice_config

object

Voice-specific configuration object. Properties depend on the selected voice. Check voice details using /v1/voices/{voice_id} endpoint.

Response

The API returns audio data in the requested format. Response headers include metadata:

X-Request-ID: Unique request identifier
X-Voice-ID: Voice ID used for synthesis
X-Language: Language code
X-Audio-Format: Audio format
X-Sample-Rate: Sample rate
X-Duration-Ms: Audio duration in milliseconds
X-Created-At: Creation timestamp
Transfer-Encoding: Transfer encoding (defaults to chunked streaming)

HTTP/1.1 200 OK
X-Request-ID: req_1234567890
X-Voice-ID: vl-xiaoxiao
X-Language: zh-CN
X-Audio-Format: wav
X-Sample-Rate: 16000
X-Duration-Ms: 2500
X-Created-At: 2024-01-01T12:00:00Z
Transfer-Encoding: chunked
Content-Type: audio/wav

[Audio binary data]

Streaming Response

By default, the API uses chunked transfer encoding for streaming audio data. This allows you to start playing audio while it’s still being generated, reducing latency.

WebSocket API

The WebSocket API provides real-time bidirectional communication for TTS operations, ideal for interactive applications.

Connection

Connect to wss://api.voxnexus.ai/v1/tts/realtime with authentication:

// Connect with token as query parameter (recommended)
const ws = new WebSocket('wss://api.voxnexus.ai/v1/tts/realtime?token=YOUR_API_KEY');

Message Flow

Initialize: Send an init message to configure voice parameters
Send Text: Send text messages with content to synthesize
Receive Audio: Receive audio messages with Base64-encoded audio data
Handle Errors: Monitor for error messages

Initialization Message

{
  "type": "init",
  "voice_id": "vl-xiaoxiao",
  "language": "zh-CN",
  "format": "wav",
  "sample_rate": 16000,
  "speed": 1.0,
  "pitch": 0,
  "volume": 1.0,
  "voice_config": {
    "style": "cheerful",
    "role": "Girl",
    "degree": 0.5
  }
}

Text Message

{
  "type": "text",
  "text": "Hello, this is a test",
  "is_final": false
}

Audio Response

{
  "type": "audio",
  "data": "base64-encoded-audio-data",
  "is_final": false
}

Complete Example

// Connect with token as query parameter
const ws = new WebSocket('wss://api.voxnexus.ai/v1/tts/realtime?token=YOUR_API_KEY');

ws.onopen = () => {
  // Initialize
  ws.send(JSON.stringify({
    type: 'init',
    voice_id: 'vl-xiaoxiao',
    format: 'wav',
    sample_rate: 16000
  }));
};

ws.onmessage = (event) => {
  const message = JSON.parse(event.data);
  
  switch (message.type) {
    case 'ready':
      console.log('Ready:', message.request_id);
      // Send text to synthesize
      ws.send(JSON.stringify({
        type: 'text',
        text: 'Hello, this is a test',
        is_final: true
      }));
      break;
      
    case 'audio':
      // Decode and play audio
      const audioData = atob(message.data);
      // Handle audio playback
      break;
      
    case 'error':
      console.error('Error:', message.error);
      break;
  }
};

Best Practices

Voice Selection

Use the /v1/voices endpoint to browse available voices
Filter voices by language, gender, age, or style
Test voices using sample audio URLs before production use

Performance Optimization

Use streaming for long texts to reduce perceived latency
Choose appropriate sample rates (16kHz is sufficient for most use cases)
Use PCM format for real-time WebSocket streaming, WAV for REST API

Error Handling

Always implement proper error handling:

try {
  const response = await fetch('https://api.voxnexus.ai/v1/tts', {
    method: 'POST',
    headers: {
      'X-Api-Key': 'YOUR_API_KEY',
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      text: 'Hello',
      voice_id: 'vl-xiaoxiao'
    })
  });
  
  if (!response.ok) {
    const error = await response.json();
    throw new Error(error.error || 'Request failed');
  }
  
  // Handle audio data
} catch (error) {
  console.error('TTS Error:', error);
}

Rate Limits and Quotas

Implement exponential backoff for 429 responses
Consider using WebSocket API for high-frequency use cases

Common Use Cases

Interactive Voice Response (IVR)

Use WebSocket API for real-time synthesis in IVR systems:

// Synthesize prompts in real-time
ws.send(JSON.stringify({
  type: 'text',
  text: 'Please press 1 for sales',
  is_final: false
}));

Content Narration

Use REST API for batch processing of long-form content:

# Process entire articles or books
curl -X POST https://api.voxnexus.ai/v1/tts \
  -H "X-Api-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d @article.json \
  --output narration.wav

Accessibility Features

Generate audio versions of text content for accessibility:

async function generateAudioAccessibility(text) {
  const response = await fetch('https://api.voxnexus.ai/v1/tts', {
    method: 'POST',
    headers: {
      'X-Api-Key': 'YOUR_API_KEY',
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      text: text,
      voice_id: 'vl-xiaoxiao',
      format: 'wav',
      sample_rate: 24000 // Higher quality for better clarity
    })
  });
  
  return response.blob();
}

Getting Started

REST API

WebSocket API

​Overview

​REST API

​Basic Usage

​Request Parameters

​Response

​Streaming Response

​WebSocket API

​Connection

​Message Flow

​Initialization Message

​Text Message

​Audio Response

​Complete Example

​Best Practices

​Voice Selection

​Performance Optimization

​Error Handling

​Rate Limits and Quotas

​Common Use Cases

​Interactive Voice Response (IVR)

​Content Narration

​Accessibility Features

Overview

REST API

Basic Usage

Request Parameters

Response

Streaming Response

WebSocket API

Connection

Message Flow

Initialization Message

Text Message

Audio Response

Complete Example

Best Practices

Voice Selection

Performance Optimization

Error Handling

Rate Limits and Quotas

Common Use Cases

Interactive Voice Response (IVR)

Content Narration

Accessibility Features