Skip to main content

Overview

The Speech-to-Text (STT) API converts audio files into text transcriptions. VoxNexus supports both REST API and WebSocket API for STT operations, with features like timestamps and speaker diarization.

REST API

The REST API endpoint /v1/stt processes complete audio files and returns full transcription results.

Basic Usage

curl -X POST "https://api.voxnexus.ai/v1/stt?model_id=vn-stt-ultra&sample_rate=16000&language=zh-CN" \
  -H "X-Api-Key: YOUR_API_KEY" \
  -H "Content-Type: audio/wav" \
  --data-binary @audio.wav

Query Parameters

model_id
string
required
Model identifier. Specifies which model to use for STT. Use the /v1/models endpoint to browse available models.
language
string
Language or locale code. Supports both ISO 639-1 language codes (e.g., en, zh) and BCP 47 locale codes (e.g., en-US, zh-CN). When a language code is provided, the system will automatically resolve it to the most common locale (e.g., enen-US). Default: en-US. Optional but recommended for better recognition accuracy.
sample_rate
integer
required
Sample rate in Hz. Required parameter. Common values: 16000 for telephony, 44100 for high-quality audio.
enable_timestamps
boolean
Whether to return word-level timestamps. Default: false.
enable_speaker_diarization
boolean
Whether to enable speaker diarization (identify different speakers). Default: false.
enable_llm_transform
boolean
Whether to enable LLM post-processing on the STT transcript. Default: false. When enabled, the recognized transcript is passed to the LLM with the specified llm_prompt, and the transformed result is returned in the text field. If LLM transform fails, text falls back to the raw transcript.
llm_prompt
string
LLM transform instruction. Required when enable_llm_transform=true. Describes what the LLM should do with the transcript, e.g. "correct punctuation", "translate to English", "summarize key points".
llm_model_id
string
LLM model ID to use for transform. Optional — falls back to the server-configured default when not specified. Routing is prefix-based: claude-* models route to Anthropic; all other model IDs route to OpenAI. Examples: claude-haiku-4-5-20251001 (Anthropic), gpt-4.1 (OpenAI).
llm_max_tokens
integer
Maximum output tokens for LLM transform. Optional. Range: 1 - 4096.

Request Body

The request body should contain the audio file in one of the supported formats:
  • audio/wav
  • audio/mpeg (MP3)
  • audio/pcm
  • application/octet-stream

Response

{
  "request_id": "req_1234567890",
  "language": "en",
  "transcript": "Hello this is a test message",
  "text": "Hello, this is a test message.",
  "duration_ms": 2500,
  "words": [
    {
      "word": "hello",
      "offset": 0,
      "duration": 500,
      "confidence": 0.98
    }
  ],
  "speakers": [
    {
      "speaker_id": "speaker_1",
      "text": "Hello, this is a test message.",
      "offset": 0,
      "duration": 2500
    }
  ],
  "created_at": "2024-01-01T12:00:00Z"
}
request_id
string
Unique identifier for this request.
language
string
Detected or specified language code (e.g., en, en-US, zh). May not be present if language detection is not enabled.
transcript
string
Raw STT recognition output (original ASR text before any LLM processing). Always present; identical to text when LLM transform is not enabled.
text
string
Final output text. When LLM transform is enabled and succeeds, this contains the LLM-transformed result; otherwise it equals transcript.
duration_ms
integer
Audio duration in milliseconds.
words
array
Word-level information array. Only present if enable_timestamps is true. Each item contains:
  • word: The recognized word
  • offset: Start time in milliseconds
  • duration: Duration in milliseconds
  • confidence: Confidence score (0.0-1.0)
speakers
array
Speaker information array. Only present if enable_speaker_diarization is true. Each item contains:
  • speaker_id: Unique speaker identifier
  • text: Text spoken by this speaker
  • offset: Start time in milliseconds
  • duration: Duration in milliseconds
created_at
string
Timestamp when the transcription was created (ISO 8601 format).

Response Headers

  • X-Request-ID: Request identifier
  • X-Language: Detected language code
  • X-Duration-Ms: Audio duration in milliseconds

WebSocket API

The WebSocket API provides real-time speech recognition, ideal for live transcription scenarios.

Connection

Connect to wss://api.voxnexus.ai/v1/stt/realtime with authentication:
// Connect with token as query parameter (recommended)
const ws = new WebSocket('wss://api.voxnexus.ai/v1/stt/realtime?token=YOUR_API_KEY');

Message Flow

  1. Initialize: Send an init message with recognition parameters
  2. Send Audio: Continuously send audio messages with Base64-encoded audio chunks
  3. Receive Results: Receive transcript messages (is_final: false for interim, is_final: true for complete sentences)
  4. Handle Errors: Monitor for error messages

Initialization Message

{
  "type": "init",
  "model_id": "vn-stt-ultra",
  "language": "zh-CN",
  "format": "pcm",
  "sample_rate": 16000,
  "enable_timestamps": true
}
type
string
required
Message type. Must be init.
model_id
string
required
Model identifier. Specifies which model to use for STT. Use the /v1/models endpoint to browse available models.
language
string
Language or locale code. Supports both ISO 639-1 language codes (e.g., en, zh) and BCP 47 locale codes (e.g., en-US, zh-CN). When a language code is provided, the system will automatically resolve it to the most common locale (e.g., enen-US). Optional but recommended for better accuracy.
format
string
required
Audio format. Only pcm is supported.
sample_rate
integer
required
Sample rate in Hz. Only 16000 is supported.
enable_timestamps
boolean
Whether to return word-level timestamps. Default: false.
enable_llm_transform
boolean
Whether to enable LLM post-processing on each final transcript. Default: false. When enabled, the server sends llm messages alongside transcript messages.
llm_prompt
string
LLM transform instruction. Required when enable_llm_transform=true. E.g. "correct punctuation", "translate to English", "rewrite humorously".
llm_model_id
string
LLM model ID. Optional — falls back to server default when not specified. Routing is prefix-based: claude-* models route to Anthropic; all other model IDs route to OpenAI. Examples: claude-haiku-4-5-20251001 (Anthropic), gpt-4.1 (OpenAI).
llm_max_tokens
integer
Maximum output tokens for LLM transform. Optional. Range: 1 - 4096.
llm_mode
string
LLM processing mode. Default: per_segment.
  • per_segment: LLM runs on each final sentence as it arrives (low latency, real-time).
  • post_flush: LLM runs once on the full accumulated text after flush (full context).
llm_post_flush
boolean
When llm_mode=per_segment, also run a full-text LLM pass after flush. Default: false. Produces both per-sentence llm messages (with segment_id) and a final full-text llm message (without segment_id) after flush completes.
llm_post_flush_prompt
string
Separate prompt for the post-flush full-text pass. Falls back to llm_prompt when not specified.

Audio Message

{
  "type": "audio",
  "data": "base64-encoded-audio-chunk"
}

Command Message

{
  "type": "command",
  "command": "flush"
}
The flush command tells the server that no more audio will be sent. The server will respond with a flush_done message after completing recognition. Subsequent audio will start a new recognition session.

Server Messages

Ready Message
{
  "type": "ready",
  "request_id": "req_1234567890",
  "language": "zh-CN",
  "format": "pcm",
  "sample_rate": 16000
}
Transcript Message
{
  "type": "transcript",
  "request_id": "req_1234567890",
  "segment_id": "550e8400-e29b-41d4-a716-446655440000",
  "text": "Hello, this is a test message.",
  "is_final": true,
  "language": "en",
  "confidence": 0.95,
  "offset": 0,
  "duration": 2500,
  "words": [
    {
      "word": "hello",
      "offset": 0,
      "duration": 500,
      "confidence": 0.98
    }
  ]
}
The is_final field distinguishes between partial results (false) and complete sentences (true). The confidence score and word-level information are only valid when is_final is true. segment_id is only present when is_final=true and LLM transform is enabled — it links subsequent llm messages to this segment.
LLM Transform Message
{
  "type": "llm",
  "request_id": "req_1234567890",
  "segment_id": "550e8400-e29b-41d4-a716-446655440000",
  "delta": "Hello,",
  "is_final": false
}
Each final transcript triggers a sequence of llm messages streaming the LLM output incrementally. Concatenate all delta values until is_final=true to get the full transformed text. The segment_id links this output to its corresponding transcript message. For a full-text post-flush pass, segment_id is absent.
Flush Done Message
{
  "type": "flush_done",
  "request_id": "req_1234567890"
}
Error Message
{
  "type": "error",
  "error": "Invalid audio format",
  "code": "UNSUPPORTED_FORMAT",
  "request_id": "req_1234567890"
}

Complete Example

// Connect with token as query parameter
const ws = new WebSocket('wss://api.voxnexus.ai/v1/stt/realtime?token=YOUR_API_KEY');

let audioContext;
let mediaRecorder;

ws.onopen = () => {
  // Initialize recognition
  ws.send(JSON.stringify({
    type: 'init',
    model_id: 'vn-stt-ultra',
    format: 'pcm',
    sample_rate: 16000,
    enable_timestamps: true
  }));
  
  // Start audio capture
  navigator.mediaDevices.getUserMedia({ audio: true })
    .then(stream => {
      audioContext = new AudioContext({ sampleRate: 16000 });
      const source = audioContext.createMediaStreamSource(stream);
      const processor = audioContext.createScriptProcessor(4096, 1, 1);
      
      processor.onaudioprocess = (e) => {
        const audioData = e.inputBuffer.getChannelData(0);
        const pcm16 = new Int16Array(audioData.length);
        for (let i = 0; i < audioData.length; i++) {
          pcm16[i] = Math.max(-32768, Math.min(32767, audioData[i] * 32768));
        }
        
        const base64 = btoa(String.fromCharCode(...new Uint8Array(pcm16.buffer)));
        ws.send(JSON.stringify({
          type: 'audio',
          data: base64
        }));
      };
      
      source.connect(processor);
      processor.connect(audioContext.destination);
    });
};

ws.onmessage = (event) => {
  const message = JSON.parse(event.data);
  
  switch (message.type) {
    case 'ready':
      console.log('Ready:', message.request_id);
      break;
      
    case 'transcript':
      if (message.is_final) {
        console.log('Final:', message.text);
        if (message.language) {
          console.log('Detected language:', message.language);
        }
        if (message.confidence) {
          console.log('Confidence:', message.confidence);
        }
      } else {
        console.log('Partial:', message.text);
      }
      break;
      
    case 'flush_done':
      console.log('Flush done:', message.request_id);
      break;
      
    case 'error':
      console.error('Error:', message.error);
      break;
  }
};

Best Practices

Audio Format Selection

  • PCM: Best for real-time WebSocket streaming, requires exact sample rate specification
  • WAV: Good for REST API, includes format headers
  • MP3: Compressed format, good for file uploads, requires decoding

Sample Rate Guidelines

  • 8kHz: Telephony quality, sufficient for phone recordings
  • 16kHz: Standard quality, good balance of quality and file size
  • 22.05kHz: Radio quality
  • 44.1kHz/48kHz: High-quality audio, use for professional recordings

Language Specification

Always specify the language when known:
# Better accuracy with language specified
curl -X POST "https://api.voxnexus.ai/v1/stt?model_id=vn-stt-ultra&sample_rate=16000&language=zh-CN" \
  -H "X-Api-Key: YOUR_API_KEY" \
  -H "Content-Type: audio/wav" \
  --data-binary @audio.wav

Timestamps

Enable timestamps for word-level timing information:
const response = await fetch('https://api.voxnexus.ai/v1/stt?model_id=vn-stt-ultra&sample_rate=16000&enable_timestamps=true', {
  method: 'POST',
  headers: {
    'X-Api-Key': 'YOUR_API_KEY',
    'Content-Type': 'audio/wav'
  },
  body: audioFile
});

const result = await response.json();
// Use word-level timestamps for subtitles or annotations
result.words.forEach(word => {
  const endTime = word.offset + word.duration;
  console.log(`${word.word} (${word.offset}-${endTime}ms, confidence: ${word.confidence})`);
});

Speaker Diarization

Use speaker diarization for multi-speaker scenarios:
curl -X POST "https://api.voxnexus.ai/v1/stt?model_id=vn-stt-ultra&sample_rate=16000&enable_speaker_diarization=true" \
  -H "X-Api-Key: YOUR_API_KEY" \
  -H "Content-Type: audio/wav" \
  --data-binary @meeting.wav

Error Handling

Implement robust error handling:
async function transcribeAudio(audioFile) {
  try {
    const response = await fetch('https://api.voxnexus.ai/v1/stt?model_id=vn-stt-ultra&sample_rate=16000', {
      method: 'POST',
      headers: {
        'X-Api-Key': 'YOUR_API_KEY',
        'Content-Type': 'audio/wav'
      },
      body: audioFile
    });
    
    if (!response.ok) {
      const error = await response.json();
      throw new Error(error.error || `HTTP ${response.status}`);
    }
    
    const result = await response.json();
    return result.text;
  } catch (error) {
    console.error('STT Error:', error);
    throw error;
  }
}

Common Use Cases

Meeting Transcription

Transcribe meeting recordings with speaker identification:
async function transcribeMeeting(audioFile) {
  const response = await fetch(
    'https://api.voxnexus.ai/v1/stt?model_id=vn-stt-ultra&sample_rate=44100&enable_speaker_diarization=true&enable_timestamps=true',
    {
      method: 'POST',
      headers: {
        'X-Api-Key': 'YOUR_API_KEY',
        'Content-Type': 'audio/wav'
      },
      body: audioFile
    }
  );
  
  const result = await response.json();
  
  // Format as meeting transcript
  result.speakers.forEach(speaker => {
    console.log(`Speaker ${speaker.speaker_id}: ${speaker.text}`);
  });
  
  return result;
}

Live Captioning

Use WebSocket API for real-time captioning:
// Stream audio and display captions in real-time
ws.onmessage = (event) => {
  const message = JSON.parse(event.data);
  
  if (message.type === 'transcript') {
    if (!message.is_final) {
      // Update caption display with interim results
      updateCaptionDisplay(message.text, true);
    } else {
      // Finalize caption
      updateCaptionDisplay(message.text, false);
    }
  }
};

Voice Commands

Process voice commands with real-time recognition:
ws.send(JSON.stringify({
  type: 'init',
  model_id: 'vn-stt-ultra',
  format: 'pcm',
  sample_rate: 16000,
  enable_timestamps: false
}));

ws.onmessage = (event) => {
  const message = JSON.parse(event.data);
  if (message.type === 'transcript' && message.is_final) {
    const text = message.text.toLowerCase();
    if (text.includes('activate')) {
      handleActivateCommand();
    } else if (text.includes('deactivate')) {
      handleDeactivateCommand();
    }
  }
};

Audio Content Indexing

Index audio content for search:
async function indexAudioContent(audioFile, metadata) {
  const response = await fetch(
    'https://api.voxnexus.ai/v1/stt?model_id=vn-stt-ultra&sample_rate=16000&enable_timestamps=true',
    {
      method: 'POST',
      headers: {
        'X-Api-Key': 'YOUR_API_KEY',
        'Content-Type': 'audio/wav'
      },
      body: audioFile
    }
  );
  
  const result = await response.json();
  
  // Store transcription with timestamps for search
  await storeIndexedContent({
    ...metadata,
    transcription: result.text,
    words: result.words,
    duration: result.duration_ms
  });
  
  return result;
}

Rate Limits and Quotas

  • Implement retry logic with exponential backoff for 429 responses
  • Consider WebSocket API for continuous streaming scenarios
  • Batch process large audio files during off-peak hours