Speech-to-Text Guide

Overview

The Speech-to-Text (STT) API converts audio files into text transcriptions. VoxNexus supports both REST API and WebSocket API for STT operations, with features like timestamps and speaker diarization.

REST API

The REST API endpoint /v1/stt processes complete audio files and returns full transcription results.

Basic Usage

curl -X POST "https://api.voxnexus.ai/v1/stt?sample_rate=16000&language=zh-CN" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: audio/wav" \
  --data-binary @audio.wav

Query Parameters

language

string

Language code (e.g., zh-CN, en-US). Optional but recommended for better recognition accuracy. If not provided, the service will auto-detect the language.

sample_rate

integer

required

Sample rate in Hz. Required parameter. Supported values: 8000, 16000, 22050, 44100, 48000. Common values: 16000 for telephony, 44100 for high-quality audio.

enable_timestamps

boolean

Whether to return word-level timestamps. Default: false.

enable_speaker_diarization

boolean

Whether to enable speaker diarization (identify different speakers). Default: false.

Request Body

The request body should contain the audio file in one of the supported formats:

audio/wav
audio/mpeg (MP3)
audio/pcm
application/octet-stream

Response

{
  "request_id": "req_1234567890",
  "language": "en",
  "text": "Hello, this is a test message.",
  "duration_ms": 2500,
  "words": [
    {
      "word": "hello",
      "start_time_ms": 0,
      "end_time_ms": 500
    }
  ],
  "speakers": [
    {
      "speaker_id": "speaker_1",
      "text": "Hello, this is a test message.",
      "start_time_ms": 0,
      "end_time_ms": 2500
    }
  ],
  "created_at": "2024-01-01T12:00:00Z"
}

request_id

string

Unique identifier for this request.

language

string

Detected or specified language code (e.g., en, en-US, zh). May not be present if language detection is not enabled.

text

string

Complete transcribed text.

duration_ms

integer

Audio duration in milliseconds.

words

array

Word-level information array. Only present if enable_timestamps is true. Each item contains:

word: The recognized word
start_time_ms: Start time in milliseconds
end_time_ms: End time in milliseconds

speakers

array

Speaker information array. Only present if enable_speaker_diarization is true. Each item contains:

speaker_id: Unique speaker identifier
text: Text spoken by this speaker
start_time_ms: Start time in milliseconds
end_time_ms: End time in milliseconds

created_at

string

Timestamp when the transcription was created (ISO 8601 format).

Response Headers

X-Request-ID: Request identifier
X-Language: Detected language code
X-Duration-Ms: Audio duration in milliseconds
X-RateLimit-Remaining: Remaining requests
X-Quota-Used: Credits consumed

WebSocket API

The WebSocket API provides real-time speech recognition, ideal for live transcription scenarios.

Connection

Connect to wss://api.voxnexus.ai/v1/stt/realtime with authentication:

// Connect with token as query parameter (recommended)
const ws = new WebSocket('wss://api.voxnexus.ai/v1/stt/realtime?token=YOUR_API_KEY');

Message Flow

Initialize: Send an init message with recognition parameters
Send Audio: Continuously send audio messages with Base64-encoded audio chunks
Receive Results: Receive partial (interim) and final recognition results
Handle Errors: Monitor for error messages

Initialization Message

{
  "type": "init",
  "language": "zh-CN",
  "format": "pcm",
  "sample_rate": 16000,
  "enable_timestamps": true,
  "enable_language_detection": false
}

Audio Message

{
  "type": "audio",
  "data": "base64-encoded-audio-chunk"
}

Partial Result Message

{
  "type": "partial",
  "text": "Hello, this is"
}

Final Result Message

{
  "type": "final",
  "text": "Hello, this is a test message.",
  "language": "en",
  "start_time_ms": 0,
  "end_time_ms": 2500,
  "words": [
    {
      "word": "hello",
      "start_time_ms": 0,
      "end_time_ms": 500
    }
  ]
}

Complete Example

// Connect with token as query parameter
const ws = new WebSocket('wss://api.voxnexus.ai/v1/stt/realtime?token=YOUR_API_KEY');

let audioContext;
let mediaRecorder;

ws.onopen = () => {
  // Initialize recognition
  ws.send(JSON.stringify({
    type: 'init',
    format: 'pcm',
    sample_rate: 16000,
    enable_timestamps: true,
    enable_language_detection: false
  }));
  
  // Start audio capture
  navigator.mediaDevices.getUserMedia({ audio: true })
    .then(stream => {
      audioContext = new AudioContext({ sampleRate: 16000 });
      const source = audioContext.createMediaStreamSource(stream);
      const processor = audioContext.createScriptProcessor(4096, 1, 1);
      
      processor.onaudioprocess = (e) => {
        const audioData = e.inputBuffer.getChannelData(0);
        const pcm16 = new Int16Array(audioData.length);
        for (let i = 0; i < audioData.length; i++) {
          pcm16[i] = Math.max(-32768, Math.min(32767, audioData[i] * 32768));
        }
        
        const base64 = btoa(String.fromCharCode(...new Uint8Array(pcm16.buffer)));
        ws.send(JSON.stringify({
          type: 'audio',
          data: base64
        }));
      };
      
      source.connect(processor);
      processor.connect(audioContext.destination);
    });
};

ws.onmessage = (event) => {
  const message = JSON.parse(event.data);
  
  switch (message.type) {
    case 'ready':
      console.log('Ready:', message.request_id);
      break;
      
    case 'partial':
      // Update UI with interim results
      console.log('Partial:', message.text);
      break;
      
    case 'final':
      // Handle final transcription
      console.log('Final:', message.text);
      if (message.language) {
        console.log('Detected language:', message.language);
      }
      break;
      
    case 'error':
      console.error('Error:', message.error);
      break;
  }
};

Best Practices

Audio Format Selection

PCM: Best for real-time WebSocket streaming, requires exact sample rate specification
WAV: Good for REST API, includes format headers
MP3: Compressed format, good for file uploads, requires decoding

Sample Rate Guidelines

8kHz: Telephony quality, sufficient for phone recordings
16kHz: Standard quality, good balance of quality and file size
22.05kHz: Radio quality
44.1kHz/48kHz: High-quality audio, use for professional recordings

Language Specification

Always specify the language when known:

# Better accuracy with language specified
curl -X POST "https://api.voxnexus.ai/v1/stt?sample_rate=16000&language=zh-CN" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: audio/wav" \
  --data-binary @audio.wav

Timestamps

Enable timestamps for word-level timing information:

const response = await fetch('https://api.voxnexus.ai/v1/stt?sample_rate=16000&enable_timestamps=true', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY',
    'Content-Type': 'audio/wav'
  },
  body: audioFile
});

const result = await response.json();
// Use word-level timestamps for subtitles or annotations
result.words.forEach(word => {
  console.log(`${word.word} (${word.start_time_ms}-${word.end_time_ms}ms)`);
});

Speaker Diarization

Use speaker diarization for multi-speaker scenarios:

curl -X POST "https://api.voxnexus.ai/v1/stt?sample_rate=16000&enable_speaker_diarization=true" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: audio/wav" \
  --data-binary @meeting.wav

Language Detection

Enable automatic language detection for multilingual audio:

{
  "type": "init",
  "format": "pcm",
  "sample_rate": 16000,
  "enable_language_detection": true
}

Error Handling

Implement robust error handling:

async function transcribeAudio(audioFile) {
  try {
    const response = await fetch('https://api.voxnexus.ai/v1/stt?sample_rate=16000', {
      method: 'POST',
      headers: {
        'Authorization': 'Bearer YOUR_API_KEY',
        'Content-Type': 'audio/wav'
      },
      body: audioFile
    });
    
    if (!response.ok) {
      const error = await response.json();
      throw new Error(error.error || `HTTP ${response.status}`);
    }
    
    const result = await response.json();
    return result.text;
  } catch (error) {
    console.error('STT Error:', error);
    throw error;
  }
}

Common Use Cases

Meeting Transcription

Transcribe meeting recordings with speaker identification:

async function transcribeMeeting(audioFile) {
  const response = await fetch(
    'https://api.voxnexus.ai/v1/stt?sample_rate=44100&enable_speaker_diarization=true&enable_timestamps=true',
    {
      method: 'POST',
      headers: {
        'Authorization': 'Bearer YOUR_API_KEY',
        'Content-Type': 'audio/wav'
      },
      body: audioFile
    }
  );
  
  const result = await response.json();
  
  // Format as meeting transcript
  result.speakers.forEach(speaker => {
    console.log(`Speaker ${speaker.speaker_id}: ${speaker.text}`);
  });
  
  return result;
}

Live Captioning

Use WebSocket API for real-time captioning:

// Stream audio and display captions in real-time
ws.onmessage = (event) => {
  const message = JSON.parse(event.data);
  
  if (message.type === 'partial') {
    // Update caption display with interim results
    updateCaptionDisplay(message.text, true);
  } else if (message.type === 'final') {
    // Finalize caption
    updateCaptionDisplay(message.text, false);
  }
};

Voice Commands

Process voice commands with real-time recognition:

ws.send(JSON.stringify({
  type: 'init',
  format: 'pcm',
  sample_rate: 16000,
  enable_language_detection: false
}));

ws.onmessage = (event) => {
  const message = JSON.parse(event.data);
  if (message.type === 'final') {
    const text = message.text.toLowerCase();
    if (text.includes('activate')) {
      handleActivateCommand();
    } else if (text.includes('deactivate')) {
      handleDeactivateCommand();
    }
  }
};

Audio Content Indexing

Index audio content for search:

async function indexAudioContent(audioFile, metadata) {
  const response = await fetch(
    'https://api.voxnexus.ai/v1/stt?sample_rate=16000&enable_timestamps=true',
    {
      method: 'POST',
      headers: {
        'Authorization': 'Bearer YOUR_API_KEY',
        'Content-Type': 'audio/wav'
      },
      body: audioFile
    }
  );
  
  const result = await response.json();
  
  // Store transcription with timestamps for search
  await storeIndexedContent({
    ...metadata,
    transcription: result.text,
    words: result.words,
    duration: result.duration_ms
  });
  
  return result;
}

Rate Limits and Quotas

Monitor response headers for rate limit information
Implement retry logic with exponential backoff for 429 responses
Consider WebSocket API for continuous streaming scenarios
Batch process large audio files during off-peak hours

Getting Started

REST API

WebSocket API

​Overview

​REST API

​Basic Usage

​Query Parameters

​Request Body

​Response

​Response Headers

​WebSocket API

​Connection

​Message Flow

​Initialization Message

​Audio Message

​Partial Result Message

​Final Result Message

​Complete Example

​Best Practices

​Audio Format Selection

​Sample Rate Guidelines

​Language Specification

​Timestamps

​Speaker Diarization

​Language Detection

​Error Handling

​Common Use Cases

​Meeting Transcription

​Live Captioning

​Voice Commands

​Audio Content Indexing

​Rate Limits and Quotas

Overview

REST API

Basic Usage

Query Parameters

Request Body

Response

Response Headers

WebSocket API

Connection

Message Flow

Initialization Message

Audio Message

Partial Result Message

Final Result Message

Complete Example

Best Practices

Audio Format Selection

Sample Rate Guidelines

Language Specification

Timestamps

Speaker Diarization

Language Detection

Error Handling

Common Use Cases

Meeting Transcription

Live Captioning

Voice Commands

Audio Content Indexing

Rate Limits and Quotas