Skip to main content

Overview

The Speech-to-Text (STT) API converts audio files into text transcriptions. VoxNexus supports both REST API and WebSocket API for STT operations, with features like timestamps, confidence scores, and speaker diarization.

REST API

The REST API endpoint /v1/stt processes complete audio files and returns full transcription results.

Basic Usage

curl -X POST "https://api.voxnexus.ai/v1/stt?sample_rate=16000&language=zh-CN" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: audio/wav" \
  --data-binary @audio.wav

Query Parameters

language
string
Language code (e.g., zh-CN, en-US). Optional but recommended for better recognition accuracy. If not provided, the service will auto-detect the language.
sample_rate
integer
required
Sample rate in Hz. Required parameter. Supported values: 8000, 16000, 22050, 44100, 48000. Common values: 16000 for telephony, 44100 for high-quality audio.
enable_timestamps
boolean
Whether to return word-level timestamps. Default: false.
enable_confidence
boolean
Whether to return confidence scores for recognition results. Default: false.
enable_speaker_diarization
boolean
Whether to enable speaker diarization (identify different speakers). Default: false.

Request Body

The request body should contain the audio file in one of the supported formats:
  • audio/wav
  • audio/mpeg (MP3)
  • audio/pcm
  • application/octet-stream

Response

{
  "request_id": "req_1234567890",
  "language": "zh-CN",
  "text": "Hello, this is a test message.",
  "confidence": 0.95,
  "duration_ms": 2500,
  "words": [
    {
      "word": "hello",
      "start_time_ms": 0,
      "end_time_ms": 500,
      "confidence": 0.98
    }
  ],
  "speakers": [
    {
      "speaker_id": "speaker_1",
      "text": "Hello, this is a test message.",
      "start_time_ms": 0,
      "end_time_ms": 2500
    }
  ],
  "created_at": "2024-01-01T12:00:00Z"
}
request_id
string
Unique identifier for this request.
language
string
Detected or specified language code.
text
string
Complete transcribed text.
confidence
number
Overall confidence score (0.0-1.0). Only present if enable_confidence is true.
duration_ms
integer
Audio duration in milliseconds.
words
array
Word-level information array. Only present if enable_timestamps is true. Each item contains:
  • word: The recognized word
  • start_time_ms: Start time in milliseconds
  • end_time_ms: End time in milliseconds
  • confidence: Confidence score (if enabled)
speakers
array
Speaker information array. Only present if enable_speaker_diarization is true. Each item contains:
  • speaker_id: Unique speaker identifier
  • text: Text spoken by this speaker
  • start_time_ms: Start time in milliseconds
  • end_time_ms: End time in milliseconds
created_at
string
Timestamp when the transcription was created (ISO 8601 format).

Response Headers

  • X-Request-ID: Request identifier
  • X-Language: Detected language code
  • X-Duration-Ms: Audio duration in milliseconds
  • X-Confidence: Overall confidence score (if enabled)
  • X-RateLimit-Remaining: Remaining requests
  • X-Quota-Used: Credits consumed

WebSocket API

The WebSocket API provides real-time speech recognition, ideal for live transcription scenarios.

Connection

Connect to wss://api.voxnexus.ai/v1/stt/realtime with authentication:
const ws = new WebSocket('wss://api.voxnexus.ai/v1/stt/realtime', {
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY'
  }
});

Message Flow

  1. Initialize: Send an init message with recognition parameters
  2. Send Audio: Continuously send audio messages with Base64-encoded audio chunks
  3. Receive Results: Receive partial (interim) and final recognition results
  4. Handle Errors: Monitor for error messages

Initialization Message

{
  "type": "init",
  "language": "zh-CN",
  "format": "pcm",
  "sample_rate": 16000,
  "enable_timestamps": true,
  "enable_confidence": true,
  "enable_speaker_diarization": false,
  "keywords": ["keyword1", "keyword2"],
  "custom_vocabulary": ["custom_word"]
}

Audio Message

{
  "type": "audio",
  "data": "base64-encoded-audio-chunk"
}

Partial Result Message

{
  "type": "partial",
  "text": "Hello, this is"
}

Final Result Message

{
  "type": "final",
  "text": "Hello, this is a test message.",
  "confidence": 0.95,
  "start_time_ms": 0,
  "end_time_ms": 2500,
  "words": [
    {
      "word": "hello",
      "start_time_ms": 0,
      "end_time_ms": 500,
      "confidence": 0.98
    }
  ]
}

Complete Example

const ws = new WebSocket('wss://api.voxnexus.ai/v1/stt/realtime', {
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY'
  }
});

let audioContext;
let mediaRecorder;

ws.onopen = () => {
  // Initialize recognition
  ws.send(JSON.stringify({
    type: 'init',
    format: 'pcm',
    sample_rate: 16000,
    enable_timestamps: true,
    enable_confidence: true
  }));
  
  // Start audio capture
  navigator.mediaDevices.getUserMedia({ audio: true })
    .then(stream => {
      audioContext = new AudioContext({ sampleRate: 16000 });
      const source = audioContext.createMediaStreamSource(stream);
      const processor = audioContext.createScriptProcessor(4096, 1, 1);
      
      processor.onaudioprocess = (e) => {
        const audioData = e.inputBuffer.getChannelData(0);
        const pcm16 = new Int16Array(audioData.length);
        for (let i = 0; i < audioData.length; i++) {
          pcm16[i] = Math.max(-32768, Math.min(32767, audioData[i] * 32768));
        }
        
        const base64 = btoa(String.fromCharCode(...new Uint8Array(pcm16.buffer)));
        ws.send(JSON.stringify({
          type: 'audio',
          data: base64
        }));
      };
      
      source.connect(processor);
      processor.connect(audioContext.destination);
    });
};

ws.onmessage = (event) => {
  const message = JSON.parse(event.data);
  
  switch (message.type) {
    case 'ready':
      console.log('Ready:', message.request_id);
      break;
      
    case 'partial':
      // Update UI with interim results
      console.log('Partial:', message.text);
      break;
      
    case 'final':
      // Handle final transcription
      console.log('Final:', message.text);
      console.log('Confidence:', message.confidence);
      break;
      
    case 'error':
      console.error('Error:', message.error);
      break;
  }
};

Best Practices

Audio Format Selection

  • PCM: Best for real-time WebSocket streaming, requires exact sample rate specification
  • WAV: Good for REST API, includes format headers
  • MP3: Compressed format, good for file uploads, requires decoding

Sample Rate Guidelines

  • 8kHz: Telephony quality, sufficient for phone recordings
  • 16kHz: Standard quality, good balance of quality and file size
  • 22.05kHz: Radio quality
  • 44.1kHz/48kHz: High-quality audio, use for professional recordings

Language Specification

Always specify the language when known:
# Better accuracy with language specified
curl -X POST "https://api.voxnexus.ai/v1/stt?sample_rate=16000&language=zh-CN" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: audio/wav" \
  --data-binary @audio.wav

Timestamps and Confidence

Enable timestamps and confidence for better results:
const response = await fetch('https://api.voxnexus.ai/v1/stt?sample_rate=16000&enable_timestamps=true&enable_confidence=true', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY',
    'Content-Type': 'audio/wav'
  },
  body: audioFile
});

const result = await response.json();
// Use word-level timestamps for subtitles or annotations
result.words.forEach(word => {
  console.log(`${word.word} (${word.start_time_ms}-${word.end_time_ms}ms)`);
});

Speaker Diarization

Use speaker diarization for multi-speaker scenarios:
curl -X POST "https://api.voxnexus.ai/v1/stt?sample_rate=16000&enable_speaker_diarization=true" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: audio/wav" \
  --data-binary @meeting.wav

Keywords and Custom Vocabulary

Improve recognition accuracy for domain-specific terms:
{
  "type": "init",
  "format": "pcm",
  "sample_rate": 16000,
  "keywords": ["VoxNexus", "API", "WebSocket"],
  "custom_vocabulary": ["technical_term_1", "technical_term_2"]
}

Error Handling

Implement robust error handling:
async function transcribeAudio(audioFile) {
  try {
    const response = await fetch('https://api.voxnexus.ai/v1/stt?sample_rate=16000', {
      method: 'POST',
      headers: {
        'Authorization': 'Bearer YOUR_API_KEY',
        'Content-Type': 'audio/wav'
      },
      body: audioFile
    });
    
    if (!response.ok) {
      const error = await response.json();
      throw new Error(error.error || `HTTP ${response.status}`);
    }
    
    const result = await response.json();
    return result.text;
  } catch (error) {
    console.error('STT Error:', error);
    throw error;
  }
}

Common Use Cases

Meeting Transcription

Transcribe meeting recordings with speaker identification:
async function transcribeMeeting(audioFile) {
  const response = await fetch(
    'https://api.voxnexus.ai/v1/stt?sample_rate=44100&enable_speaker_diarization=true&enable_timestamps=true',
    {
      method: 'POST',
      headers: {
        'Authorization': 'Bearer YOUR_API_KEY',
        'Content-Type': 'audio/wav'
      },
      body: audioFile
    }
  );
  
  const result = await response.json();
  
  // Format as meeting transcript
  result.speakers.forEach(speaker => {
    console.log(`Speaker ${speaker.speaker_id}: ${speaker.text}`);
  });
  
  return result;
}

Live Captioning

Use WebSocket API for real-time captioning:
// Stream audio and display captions in real-time
ws.onmessage = (event) => {
  const message = JSON.parse(event.data);
  
  if (message.type === 'partial') {
    // Update caption display with interim results
    updateCaptionDisplay(message.text, true);
  } else if (message.type === 'final') {
    // Finalize caption
    updateCaptionDisplay(message.text, false);
  }
};

Voice Commands

Process voice commands with keyword detection:
ws.send(JSON.stringify({
  type: 'init',
  format: 'pcm',
  sample_rate: 16000,
  keywords: ['activate', 'deactivate', 'start', 'stop']
}));

ws.onmessage = (event) => {
  const message = JSON.parse(event.data);
  if (message.type === 'final') {
    const text = message.text.toLowerCase();
    if (text.includes('activate')) {
      handleActivateCommand();
    } else if (text.includes('deactivate')) {
      handleDeactivateCommand();
    }
  }
};

Audio Content Indexing

Index audio content for search:
async function indexAudioContent(audioFile, metadata) {
  const response = await fetch(
    'https://api.voxnexus.ai/v1/stt?sample_rate=16000&enable_timestamps=true',
    {
      method: 'POST',
      headers: {
        'Authorization': 'Bearer YOUR_API_KEY',
        'Content-Type': 'audio/wav'
      },
      body: audioFile
    }
  );
  
  const result = await response.json();
  
  // Store transcription with timestamps for search
  await storeIndexedContent({
    ...metadata,
    transcription: result.text,
    words: result.words,
    duration: result.duration_ms
  });
  
  return result;
}

Rate Limits and Quotas

  • Monitor response headers for rate limit information
  • Implement retry logic with exponential backoff for 429 responses
  • Consider WebSocket API for continuous streaming scenarios
  • Batch process large audio files during off-peak hours