Overview
The Speech-to-Text (STT) API converts audio files into text transcriptions. VoxNexus supports both REST API and WebSocket API for STT operations, with features like timestamps, confidence scores, and speaker diarization.
REST API
The REST API endpoint /v1/stt processes complete audio files and returns full transcription results.
Basic Usage
curl -X POST "https://api.voxnexus.ai/v1/stt?sample_rate=16000&language=zh-CN" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: audio/wav" \
--data-binary @audio.wav
Query Parameters
Language code (e.g., zh-CN, en-US). Optional but recommended for better recognition accuracy. If not provided, the service will auto-detect the language.
Sample rate in Hz. Required parameter. Supported values: 8000, 16000, 22050, 44100, 48000. Common values: 16000 for telephony, 44100 for high-quality audio.
Whether to return word-level timestamps. Default: false.
Whether to return confidence scores for recognition results. Default: false.
enable_speaker_diarization
Whether to enable speaker diarization (identify different speakers). Default: false.
Request Body
The request body should contain the audio file in one of the supported formats:
audio/wav
audio/mpeg (MP3)
audio/pcm
application/octet-stream
Response
{
"request_id": "req_1234567890",
"language": "zh-CN",
"text": "Hello, this is a test message.",
"confidence": 0.95,
"duration_ms": 2500,
"words": [
{
"word": "hello",
"start_time_ms": 0,
"end_time_ms": 500,
"confidence": 0.98
}
],
"speakers": [
{
"speaker_id": "speaker_1",
"text": "Hello, this is a test message.",
"start_time_ms": 0,
"end_time_ms": 2500
}
],
"created_at": "2024-01-01T12:00:00Z"
}
Unique identifier for this request.
Detected or specified language code.
Complete transcribed text.
Overall confidence score (0.0-1.0). Only present if enable_confidence is true.
Audio duration in milliseconds.
Word-level information array. Only present if enable_timestamps is true. Each item contains:
word: The recognized word
start_time_ms: Start time in milliseconds
end_time_ms: End time in milliseconds
confidence: Confidence score (if enabled)
Speaker information array. Only present if enable_speaker_diarization is true. Each item contains:
speaker_id: Unique speaker identifier
text: Text spoken by this speaker
start_time_ms: Start time in milliseconds
end_time_ms: End time in milliseconds
Timestamp when the transcription was created (ISO 8601 format).
X-Request-ID: Request identifier
X-Language: Detected language code
X-Duration-Ms: Audio duration in milliseconds
X-Confidence: Overall confidence score (if enabled)
X-RateLimit-Remaining: Remaining requests
X-Quota-Used: Credits consumed
WebSocket API
The WebSocket API provides real-time speech recognition, ideal for live transcription scenarios.
Connection
Connect to wss://api.voxnexus.ai/v1/stt/realtime with authentication:
const ws = new WebSocket('wss://api.voxnexus.ai/v1/stt/realtime', {
headers: {
'Authorization': 'Bearer YOUR_API_KEY'
}
});
Message Flow
- Initialize: Send an
init message with recognition parameters
- Send Audio: Continuously send
audio messages with Base64-encoded audio chunks
- Receive Results: Receive
partial (interim) and final recognition results
- Handle Errors: Monitor for
error messages
Initialization Message
{
"type": "init",
"language": "zh-CN",
"format": "pcm",
"sample_rate": 16000,
"enable_timestamps": true,
"enable_confidence": true,
"enable_speaker_diarization": false,
"keywords": ["keyword1", "keyword2"],
"custom_vocabulary": ["custom_word"]
}
Audio Message
{
"type": "audio",
"data": "base64-encoded-audio-chunk"
}
Partial Result Message
{
"type": "partial",
"text": "Hello, this is"
}
Final Result Message
{
"type": "final",
"text": "Hello, this is a test message.",
"confidence": 0.95,
"start_time_ms": 0,
"end_time_ms": 2500,
"words": [
{
"word": "hello",
"start_time_ms": 0,
"end_time_ms": 500,
"confidence": 0.98
}
]
}
Complete Example
const ws = new WebSocket('wss://api.voxnexus.ai/v1/stt/realtime', {
headers: {
'Authorization': 'Bearer YOUR_API_KEY'
}
});
let audioContext;
let mediaRecorder;
ws.onopen = () => {
// Initialize recognition
ws.send(JSON.stringify({
type: 'init',
format: 'pcm',
sample_rate: 16000,
enable_timestamps: true,
enable_confidence: true
}));
// Start audio capture
navigator.mediaDevices.getUserMedia({ audio: true })
.then(stream => {
audioContext = new AudioContext({ sampleRate: 16000 });
const source = audioContext.createMediaStreamSource(stream);
const processor = audioContext.createScriptProcessor(4096, 1, 1);
processor.onaudioprocess = (e) => {
const audioData = e.inputBuffer.getChannelData(0);
const pcm16 = new Int16Array(audioData.length);
for (let i = 0; i < audioData.length; i++) {
pcm16[i] = Math.max(-32768, Math.min(32767, audioData[i] * 32768));
}
const base64 = btoa(String.fromCharCode(...new Uint8Array(pcm16.buffer)));
ws.send(JSON.stringify({
type: 'audio',
data: base64
}));
};
source.connect(processor);
processor.connect(audioContext.destination);
});
};
ws.onmessage = (event) => {
const message = JSON.parse(event.data);
switch (message.type) {
case 'ready':
console.log('Ready:', message.request_id);
break;
case 'partial':
// Update UI with interim results
console.log('Partial:', message.text);
break;
case 'final':
// Handle final transcription
console.log('Final:', message.text);
console.log('Confidence:', message.confidence);
break;
case 'error':
console.error('Error:', message.error);
break;
}
};
Best Practices
- PCM: Best for real-time WebSocket streaming, requires exact sample rate specification
- WAV: Good for REST API, includes format headers
- MP3: Compressed format, good for file uploads, requires decoding
Sample Rate Guidelines
- 8kHz: Telephony quality, sufficient for phone recordings
- 16kHz: Standard quality, good balance of quality and file size
- 22.05kHz: Radio quality
- 44.1kHz/48kHz: High-quality audio, use for professional recordings
Language Specification
Always specify the language when known:
# Better accuracy with language specified
curl -X POST "https://api.voxnexus.ai/v1/stt?sample_rate=16000&language=zh-CN" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: audio/wav" \
--data-binary @audio.wav
Timestamps and Confidence
Enable timestamps and confidence for better results:
const response = await fetch('https://api.voxnexus.ai/v1/stt?sample_rate=16000&enable_timestamps=true&enable_confidence=true', {
method: 'POST',
headers: {
'Authorization': 'Bearer YOUR_API_KEY',
'Content-Type': 'audio/wav'
},
body: audioFile
});
const result = await response.json();
// Use word-level timestamps for subtitles or annotations
result.words.forEach(word => {
console.log(`${word.word} (${word.start_time_ms}-${word.end_time_ms}ms)`);
});
Speaker Diarization
Use speaker diarization for multi-speaker scenarios:
curl -X POST "https://api.voxnexus.ai/v1/stt?sample_rate=16000&enable_speaker_diarization=true" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: audio/wav" \
--data-binary @meeting.wav
Keywords and Custom Vocabulary
Improve recognition accuracy for domain-specific terms:
{
"type": "init",
"format": "pcm",
"sample_rate": 16000,
"keywords": ["VoxNexus", "API", "WebSocket"],
"custom_vocabulary": ["technical_term_1", "technical_term_2"]
}
Error Handling
Implement robust error handling:
async function transcribeAudio(audioFile) {
try {
const response = await fetch('https://api.voxnexus.ai/v1/stt?sample_rate=16000', {
method: 'POST',
headers: {
'Authorization': 'Bearer YOUR_API_KEY',
'Content-Type': 'audio/wav'
},
body: audioFile
});
if (!response.ok) {
const error = await response.json();
throw new Error(error.error || `HTTP ${response.status}`);
}
const result = await response.json();
return result.text;
} catch (error) {
console.error('STT Error:', error);
throw error;
}
}
Common Use Cases
Meeting Transcription
Transcribe meeting recordings with speaker identification:
async function transcribeMeeting(audioFile) {
const response = await fetch(
'https://api.voxnexus.ai/v1/stt?sample_rate=44100&enable_speaker_diarization=true&enable_timestamps=true',
{
method: 'POST',
headers: {
'Authorization': 'Bearer YOUR_API_KEY',
'Content-Type': 'audio/wav'
},
body: audioFile
}
);
const result = await response.json();
// Format as meeting transcript
result.speakers.forEach(speaker => {
console.log(`Speaker ${speaker.speaker_id}: ${speaker.text}`);
});
return result;
}
Live Captioning
Use WebSocket API for real-time captioning:
// Stream audio and display captions in real-time
ws.onmessage = (event) => {
const message = JSON.parse(event.data);
if (message.type === 'partial') {
// Update caption display with interim results
updateCaptionDisplay(message.text, true);
} else if (message.type === 'final') {
// Finalize caption
updateCaptionDisplay(message.text, false);
}
};
Voice Commands
Process voice commands with keyword detection:
ws.send(JSON.stringify({
type: 'init',
format: 'pcm',
sample_rate: 16000,
keywords: ['activate', 'deactivate', 'start', 'stop']
}));
ws.onmessage = (event) => {
const message = JSON.parse(event.data);
if (message.type === 'final') {
const text = message.text.toLowerCase();
if (text.includes('activate')) {
handleActivateCommand();
} else if (text.includes('deactivate')) {
handleDeactivateCommand();
}
}
};
Audio Content Indexing
Index audio content for search:
async function indexAudioContent(audioFile, metadata) {
const response = await fetch(
'https://api.voxnexus.ai/v1/stt?sample_rate=16000&enable_timestamps=true',
{
method: 'POST',
headers: {
'Authorization': 'Bearer YOUR_API_KEY',
'Content-Type': 'audio/wav'
},
body: audioFile
}
);
const result = await response.json();
// Store transcription with timestamps for search
await storeIndexedContent({
...metadata,
transcription: result.text,
words: result.words,
duration: result.duration_ms
});
return result;
}
Rate Limits and Quotas
- Monitor response headers for rate limit information
- Implement retry logic with exponential backoff for
429 responses
- Consider WebSocket API for continuous streaming scenarios
- Batch process large audio files during off-peak hours