WebSocket API Guide

Overview

The VoxNexus WebSocket API provides real-time bidirectional communication for voice services. It’s ideal for applications requiring low-latency, interactive voice processing such as:

Real-time voice assistants
Live transcription services
Interactive voice response (IVR) systems
Voice-controlled applications
Real-time captioning

Connection

Endpoints

Text-to-Speech: wss://api.voxnexus.ai/v1/tts/realtime
Speech-to-Text: wss://api.voxnexus.ai/v1/stt/realtime

Authentication

VoxNexus WebSocket API supports two authentication methods: Option 1: Query Parameter (Recommended) Append your API key as a query parameter. This method is recommended as it works with all WebSocket clients:

const ws = new WebSocket('wss://api.voxnexus.ai/v1/tts/realtime?token=YOUR_API_KEY');

Option 2: Header Use X-Api-Key header (may not work with all clients):

const ws = new WebSocket('wss://api.voxnexus.ai/v1/tts/realtime', {
  headers: {
    'X-Api-Key': 'YOUR_API_KEY'
  }
});

Query parameter authentication is recommended for browser-based applications as the standard WebSocket API doesn’t support custom headers.

Connection Lifecycle

Connect: Establish WebSocket connection with authentication
Initialize: Send initialization message with configuration
Ready: Receive ready confirmation from server
Exchange: Send/receive data messages
Close: Gracefully close connection when done

Text-to-Speech WebSocket

Message Types

Client Messages

Initialization (init)

{
  "type": "init",
  "voice_id": "vl-xiaoxiao",
  "language": "zh-CN",
  "format": "wav",
  "sample_rate": 16000,
  "speed": 1.0,
  "pitch": 0,
  "volume": 1.0,
  "voice_config": {
    "style": "cheerful",
    "role": "Girl",
    "degree": 0.5
  }
}

Text (text)

{
  "type": "text",
  "text": "Hello, this is a test message",
  "is_final": false
}

Server Messages

Ready (ready)

{
  "type": "ready",
  "request_id": "req_1234567890",
  "voice_id": "vl-xiaoxiao",
  "language": "zh-CN",
  "format": "wav",
  "sample_rate": 16000
}

Audio (audio)

{
  "type": "audio",
  "data": "base64-encoded-audio-data",
  "is_final": false
}

Error (error)

{
  "type": "error",
  "error": "Invalid voice_id",
  "code": "VOICE_NOT_FOUND",
  "request_id": "req_1234567890"
}

Complete Example

class TTSWebSocketClient {
  constructor(apiKey) {
    this.apiKey = apiKey;
    this.ws = null;
    this.ready = false;
  }
  
  connect() {
    return new Promise((resolve, reject) => {
      // Use query parameter for authentication (works with all clients)
      this.ws = new WebSocket(`wss://api.voxnexus.ai/v1/tts/realtime?token=${this.apiKey}`);
      
      this.ws.onopen = () => {
        console.log('Connected');
        resolve();
      };
      
      this.ws.onmessage = (event) => {
        this.handleMessage(JSON.parse(event.data));
      };
      
      this.ws.onerror = (error) => {
        console.error('WebSocket error:', error);
        reject(error);
      };
      
      this.ws.onclose = () => {
        console.log('Connection closed');
        this.ready = false;
      };
    });
  }
  
  initialize(config) {
    this.ws.send(JSON.stringify({
      type: 'init',
      ...config
    }));
  }
  
  synthesize(text, isFinal = false) {
    if (!this.ready) {
      throw new Error('Not ready. Wait for ready message.');
    }
    
    this.ws.send(JSON.stringify({
      type: 'text',
      text: text,
      is_final: isFinal
    }));
  }
  
  handleMessage(message) {
    switch (message.type) {
      case 'ready':
        this.ready = true;
        console.log('Ready:', message.request_id);
        break;
        
      case 'audio':
        this.onAudio(message.data, message.is_final);
        break;
        
      case 'error':
        console.error('Error:', message.error);
        this.onError(message);
        break;
    }
  }
  
  onAudio(base64Data, isFinal) {
    // Decode and handle audio
    const audioData = atob(base64Data);
    // Play audio or save to buffer
  }
  
  onError(error) {
    // Handle error
  }
  
  close() {
    if (this.ws) {
      this.ws.close();
    }
  }
}

// Usage
const client = new TTSWebSocketClient('YOUR_API_KEY');
await client.connect();
client.initialize({
  voice_id: 'vl-xiaoxiao',
  format: 'wav',
  sample_rate: 16000
});

// Wait for ready, then synthesize
setTimeout(() => {
  client.synthesize('Hello, world!', true);
}, 1000);

Speech-to-Text WebSocket

Message Types

Client Messages

Initialization (init)

{
  "type": "init",
  "language": "zh-CN",
  "format": "pcm",
  "sample_rate": 16000,
  "enable_timestamps": true,
  "enable_language_detection": false
}

Audio (audio)

{
  "type": "audio",
  "data": "base64-encoded-audio-chunk"
}

Command (command)

{
  "type": "command",
  "command": "flush"
}

The flush command tells the server that no more audio will be sent. The server will respond with a flush_done message after completing recognition. Subsequent audio will start a new recognition session.

Server Messages

Ready (ready)

{
  "type": "ready",
  "request_id": "req_1234567890",
  "language": "zh-CN",
  "format": "pcm",
  "sample_rate": 16000
}

Transcript (transcript)

{
  "type": "transcript",
  "request_id": "req_1234567890",
  "text": "Hello, this is a test message.",
  "is_final": true,
  "language": "en",
  "confidence": 0.95,
  "offset": 0,
  "duration": 2500,
  "words": [
    {
      "word": "hello",
      "offset": 0,
      "duration": 500,
      "confidence": 0.98
    }
  ]
}

The is_final field distinguishes between partial results (false) and complete sentences (true). Confidence scores and word-level information are only valid when is_final is true.

Flush Done (flush_done)

{
  "type": "flush_done",
  "request_id": "req_1234567890"
}

Error (error)

{
  "type": "error",
  "error": "Invalid audio format",
  "code": "UNSUPPORTED_FORMAT",
  "request_id": "req_1234567890"
}

Complete Example

class STTWebSocketClient {
  constructor(apiKey) {
    this.apiKey = apiKey;
    this.ws = null;
    this.ready = false;
    this.audioContext = null;
    this.processor = null;
  }
  
  async connect() {
    return new Promise((resolve, reject) => {
      // Use query parameter for authentication (works with all clients)
      this.ws = new WebSocket(`wss://api.voxnexus.ai/v1/stt/realtime?token=${this.apiKey}`);
      
      this.ws.onopen = () => {
        console.log('Connected');
        resolve();
      };
      
      this.ws.onmessage = (event) => {
        this.handleMessage(JSON.parse(event.data));
      };
      
      this.ws.onerror = (error) => {
        console.error('WebSocket error:', error);
        reject(error);
      };
      
      this.ws.onclose = () => {
        console.log('Connection closed');
        this.ready = false;
      };
    });
  }
  
  initialize(config) {
    this.ws.send(JSON.stringify({
      type: 'init',
      ...config
    }));
  }
  
  async startRecording(sampleRate = 16000) {
    const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
    this.audioContext = new AudioContext({ sampleRate });
    const source = this.audioContext.createMediaStreamSource(stream);
    
    this.processor = this.audioContext.createScriptProcessor(4096, 1, 1);
    this.processor.onaudioprocess = (e) => {
      if (!this.ready) return;
      
      const audioData = e.inputBuffer.getChannelData(0);
      const pcm16 = new Int16Array(audioData.length);
      
      for (let i = 0; i < audioData.length; i++) {
        pcm16[i] = Math.max(-32768, Math.min(32767, audioData[i] * 32768));
      }
      
      const base64 = btoa(String.fromCharCode(...new Uint8Array(pcm16.buffer)));
      this.ws.send(JSON.stringify({
        type: 'audio',
        data: base64
      }));
    };
    
    source.connect(this.processor);
    this.processor.connect(this.audioContext.destination);
  }
  
  stopRecording() {
    if (this.processor) {
      this.processor.disconnect();
      this.processor = null;
    }
    if (this.audioContext) {
      this.audioContext.close();
      this.audioContext = null;
    }
  }
  
  flush() {
    if (!this.ready) {
      throw new Error('Not ready. Wait for ready message.');
    }
    
    this.ws.send(JSON.stringify({
      type: 'command',
      command: 'flush'
    }));
  }
  
  handleMessage(message) {
    switch (message.type) {
      case 'ready':
        this.ready = true;
        console.log('Ready:', message.request_id);
        break;
        
      case 'transcript':
        if (message.is_final) {
          this.onFinal(message);
        } else {
          this.onPartial(message);
        }
        break;
        
      case 'flush_done':
        console.log('Flush done:', message.request_id);
        break;
        
      case 'error':
        console.error('Error:', message.error);
        this.onError(message);
        break;
    }
  }
  
  onPartial(transcript) {
    console.log('Partial:', transcript.text);
    // Update UI with interim results
  }
  
  onFinal(transcript) {
    console.log('Final:', transcript.text);
    if (transcript.language) {
      console.log('Detected language:', transcript.language);
    }
    if (transcript.confidence) {
      console.log('Confidence:', transcript.confidence);
    }
    // Handle final transcription
  }
  
  onError(error) {
    // Handle error
  }
  
  close() {
    this.stopRecording();
    if (this.ws) {
      this.ws.close();
    }
  }
}

// Usage
const client = new STTWebSocketClient('YOUR_API_KEY');
await client.connect();
client.initialize({
  format: 'pcm',
  sample_rate: 16000,
  enable_timestamps: true,
  enable_language_detection: false
});

// Wait for ready, then start recording
setTimeout(async () => {
  await client.startRecording(16000);
}, 1000);

Best Practices

Connection Management

Reconnection Logic

class ReconnectingWebSocket {
  constructor(url, options) {
    this.url = url;
    this.options = options;
    this.ws = null;
    this.reconnectAttempts = 0;
    this.maxReconnectAttempts = 5;
    this.reconnectDelay = 1000;
  }
  
  connect() {
    this.ws = new WebSocket(this.url, this.options);
    
    this.ws.onclose = () => {
      if (this.reconnectAttempts < this.maxReconnectAttempts) {
        setTimeout(() => {
          this.reconnectAttempts++;
          this.connect();
        }, this.reconnectDelay * this.reconnectAttempts);
      }
    };
    
    this.ws.onopen = () => {
      this.reconnectAttempts = 0;
    };
  }
}

Heartbeat/Ping

// Send periodic ping to keep connection alive
setInterval(() => {
  if (ws.readyState === WebSocket.OPEN) {
    ws.send(JSON.stringify({ type: 'ping' }));
  }
}, 30000); // Every 30 seconds

Error Handling

ws.onerror = (error) => {
  console.error('WebSocket error:', error);
  // Implement retry logic or notify user
};

ws.onclose = (event) => {
  if (event.code !== 1000) { // Not a normal closure
    console.error('Unexpected closure:', event.code, event.reason);
    // Attempt reconnection
  }
};

Audio Processing

Chunk Size Optimization

Send audio chunks of 100-200ms for optimal latency
Too small: Increased overhead
Too large: Increased latency

Buffer Management

class AudioBuffer {
  constructor() {
    this.buffer = [];
    this.chunkSize = 1600; // 100ms at 16kHz
  }
  
  addAudio(audioData) {
    this.buffer.push(...audioData);
    
    while (this.buffer.length >= this.chunkSize) {
      const chunk = this.buffer.splice(0, this.chunkSize);
      this.sendChunk(chunk);
    }
  }
  
  sendChunk(chunk) {
    const base64 = btoa(String.fromCharCode(...chunk));
    ws.send(JSON.stringify({
      type: 'audio',
      data: base64
    }));
  }
  
  flush() {
    if (this.buffer.length > 0) {
      this.sendChunk(this.buffer);
      this.buffer = [];
    }
  }
}

Performance Optimization

Batch Text Messages

// For TTS, batch multiple text segments
const texts = ['Hello', 'world', 'this', 'is', 'a', 'test'];
texts.forEach((text, index) => {
  setTimeout(() => {
    client.synthesize(text, index === texts.length - 1);
  }, index * 100);
});

Throttle Audio Sending

class ThrottledAudioSender {
  constructor(ws, interval = 100) {
    this.ws = ws;
    this.interval = interval;
    this.queue = [];
    this.timer = null;
  }
  
  send(audioData) {
    this.queue.push(audioData);
    
    if (!this.timer) {
      this.timer = setInterval(() => {
        if (this.queue.length > 0) {
          const data = this.queue.shift();
          this.ws.send(JSON.stringify({
            type: 'audio',
            data: data
          }));
        } else {
          clearInterval(this.timer);
          this.timer = null;
        }
      }, this.interval);
    }
  }
}

Common Patterns

Bidirectional Voice Conversation

// Combine TTS and STT for voice conversation
class VoiceConversation {
  constructor(apiKey) {
    this.ttsClient = new TTSWebSocketClient(apiKey);
    this.sttClient = new STTWebSocketClient(apiKey);
  }
  
  async start() {
    await Promise.all([
      this.ttsClient.connect(),
      this.sttClient.connect()
    ]);
    
    this.ttsClient.initialize({ voice_id: 'vl-xiaoxiao' });
    this.sttClient.initialize({ format: 'pcm', sample_rate: 16000, enable_timestamps: false, enable_language_detection: false });
    
    // Handle STT results and respond with TTS
    this.sttClient.onFinal = (result) => {
      const response = this.processUserInput(result.text);
      this.ttsClient.synthesize(response, true);
    };
    
    await this.sttClient.startRecording();
  }
  
  processUserInput(text) {
    // Process user input and generate response
    return `You said: ${text}`;
  }
}

Real-time Captioning

class LiveCaptioning {
  constructor(apiKey, captionElement) {
    this.sttClient = new STTWebSocketClient(apiKey);
    this.captionElement = captionElement;
  }
  
  async start() {
    await this.sttClient.connect();
    this.sttClient.initialize({
      format: 'pcm',
      sample_rate: 16000,
      enable_timestamps: true,
      enable_language_detection: false
    });
    
    this.sttClient.onPartial = (transcript) => {
      this.captionElement.textContent = transcript.text;
      this.captionElement.classList.add('interim');
    };
    
    this.sttClient.onFinal = (transcript) => {
      this.captionElement.textContent = transcript.text;
      this.captionElement.classList.remove('interim');
      // Save final caption with timestamp
      this.saveCaption(transcript);
    };
    
    await this.sttClient.startRecording();
  }
  
  saveCaption(transcript) {
    // Save caption with timing information
    const endTime = transcript.offset + transcript.duration;
    console.log(`[${transcript.offset}-${endTime}ms] ${transcript.text}`);
  }
}

Troubleshooting

Connection Issues

Problem: Connection fails immediately

Solution: Verify API key is correct and has proper permissions
Solution: Check network connectivity and firewall settings

Problem: Connection drops frequently

Solution: Implement reconnection logic with exponential backoff
Solution: Check for network instability or proxy issues

Audio Issues

Problem: No audio received (TTS)

Solution: Verify initialization message was sent and ready message received
Solution: Check that text messages are being sent correctly

Problem: Recognition not working (STT)

Solution: Verify audio format and sample rate match initialization
Solution: Check that audio chunks are being sent continuously
Solution: Ensure audio quality is sufficient (no excessive noise)

Performance Issues

Problem: High latency

Solution: Reduce audio chunk size for faster processing
Solution: Use appropriate sample rates (16kHz is usually sufficient)
Solution: Optimize network connection (use closer server if available)

Problem: High memory usage

Solution: Process and discard audio chunks after sending
Solution: Limit buffer sizes for audio data
Solution: Close unused connections promptly

Getting Started

REST API

WebSocket API

WebSocket API Guide

Overview

Connection

Endpoints

Authentication

Connection Lifecycle

Text-to-Speech WebSocket

Message Types

Client Messages

Server Messages

Complete Example

Speech-to-Text WebSocket

Message Types

Client Messages

Server Messages

Complete Example

Best Practices

Connection Management

Error Handling

Audio Processing

Performance Optimization

Common Patterns

Bidirectional Voice Conversation

Real-time Captioning

Troubleshooting

Connection Issues

Audio Issues

Performance Issues

Getting Started

REST API

WebSocket API

​Overview

​Connection

​Endpoints

​Authentication

​Connection Lifecycle

​Text-to-Speech WebSocket

​Message Types

​Client Messages

​Server Messages

​Complete Example

​Speech-to-Text WebSocket

​Message Types

​Client Messages

​Server Messages

​Complete Example

​Best Practices

​Connection Management

​Error Handling

​Audio Processing

​Performance Optimization

​Common Patterns

​Bidirectional Voice Conversation

​Real-time Captioning

​Troubleshooting

​Connection Issues

​Audio Issues

​Performance Issues

Overview

Connection

Endpoints

Authentication

Connection Lifecycle

Text-to-Speech WebSocket

Message Types

Client Messages

Server Messages

Complete Example

Speech-to-Text WebSocket

Message Types

Client Messages

Server Messages

Complete Example

Best Practices

Connection Management

Error Handling

Audio Processing

Performance Optimization

Common Patterns

Bidirectional Voice Conversation

Real-time Captioning

Troubleshooting

Connection Issues

Audio Issues

Performance Issues